Datasets

class stanscofi.datasets.Dataset(ratings=None, users=None, items=None, same_item_user_features=False, name='dataset')

Bases: object

A class used to encode a drug repurposing dataset (items are drugs, users are diseases)

…

Parameters

ratingsarray-like of shape (n_items, n_users): an array which contains values in {-1, 0, 1, np.nan} describing the negative, unlabelled, positive, unavailable user-item matchings
itemsarray-like of shape (n_item_features, n_items): an array which contains the item feature vectors
usersarray-like of shape (n_user_features, n_users): an array which contains the user feature vectors
same_item_user_featuresbool (default: False): whether the item and user features are the same (optional)
namestr: name of the dataset (optional)

Attributes

namestr: name of the dataset
ratingsCOO-array of shape (n_items, n_users): an array which contains values in {-1, 0, 1} describing the negative, unlabelled/unavailable, positive user-item matchings
foldsCOO-array of shape (n_items, n_users): an array which contains values in {0, 1} describing the unavailable and available user-item matchings in ratings
itemsCOO-array of shape (n_item_features, n_items): an array which contains the user feature vectors (NaN for missing features)
usersCOO-array of shape (n_user_features, n_users): an array which contains the item feature vectors (NaN for missing features)
item_listlist of str: a list of the item names in the order of row indices in ratings_mat
user_listlist of str: a list of the user names in the order of column indices in ratings_mat
item_featureslist of str: a list of the item feature names in the order of column indices in ratings_mat
user_featureslist of str: a list of the user feature names in the order of column indices in ratings_mat
same_item_user_featuresbool: whether the item and user features are the same
nusersint: number of users
nitemsint: number of items
nuser_featuresint: number of user features
nitem_featuresint: number of item features

Methods

__init__(ratings=None, users=None, items=None, same_item_user_features=False, name=”dataset”): Initialize the Dataset object and creates all attributes
summary(sep=”-”*70): Prints out the characteristics of the drug repurposing dataset
visualize(withzeros=False, X=None, y=None, figsize=(5,5), fontsize=20, dimred_args={}, predictions=None, use_ratings=False, random_state=1234, show_errors=False, verbose=False): Plots datapoints in the dataset annotated by the ground truth or predicted ratings
subset(folds, subset_name=”subset”): Creates a subset of the dataset based on the folds given as input

subset(folds, subset_name='subset')

Obtains a subset of a stanscofi.Dataset based on a set of user and item indices

…

Parameters

foldsCOO-array of shape (n_items, n_users): an array which contains values in {0, 1} describing the unavailable and available user-item matchings in ratings
subset_namestr: name of the newly created stanscofi.Dataset

Returns

subsetstanscofi.Dataset: dataset corresponding to the folds in input

summary(sep='----------------------------------------------------------------------')

Prints out a summary of the contents of a stanscofi.Dataset: the number of items, users, the number of positive, negative, unlabeled, unavailable matchings, the sparsity number, and the shape and percentage of missing values in the item and user feature matrices

…

Parameters

sepstr: separator for pretty printing

…

Returns

ndrugsint: number of drugs
ndiseasesint: number of diseases
ndrugs_knownint: number of drugs with at least one known (positive or negative) rating
ndiseases_knownint: number of diseases with at least one known (positive or negative) rating
npositiveint: number of positive ratings
nnegativeint: number of negative ratings
nunlabeled_unavailableint: number of unlabeled or unavailable ratings
nunavailableint: number of unavailable ratings
sparsityfloat: percentage of known ratings
sparsity_knownfloat: percentage of known ratings among drugs and diseases with at least one known rating
ndrug_featuresint: number of drug features
missing_drug_featuresfloat: percentage of missing drug feature values
ndisease_featuresint: number of disease features
missing_disease_featuresfloat: percentage of missing disease feature values

visualize(withzeros=False, X=None, y=None, metric='euclidean', figsize=(5, 5), fontsize=20, dimred_args={}, predictions=None, use_ratings=False, random_state=1234, show_errors=False, verbose=False)

Plots a representation of the datapoints in a stanscofi.Dataset which is annotated either by the ground truth labels or the predicted labels. The representation is the plot of the datapoints according to the first two Principal Components, or the first two dimensions in UMAP, if the feature matrices can be converted into a (n_ratings, n_features) shaped matrix where n_features>1, else it plots a heatmap with the values in the matrix for each rating pair.

In the legend, ground truth labels are denoted with brackets: e.g., [0] (unknown), [1] (positive) and [-1] (negative); predicted ratings are denoted by “pos” (positive) and “neg” (negative); correct (resp., incorrect) predictions are denoted by “correct”, resp. “error”

…

Parameters

withzerosbool: boolean to assess whether (user, item) unknown matchings should also be plotted; if withzeros=False, then only (item, user) pairs associated with known matchings will be plotted (but the unknown matching datapoints will still be used to compute the dimensionality reduction); otherwise, all pairs will be plotted
Xarray-like of shape (n_ratings, n_features) or None: (item, user) pair feature matrix
yarray-like of shape (n_ratings, ) or None: response vector for each (item, user) pair in X; necessarily X should not be None if y is not None, and vice versa; setting X and y automatically overrides the other behaviors of this function
metricstr: metric to consider to perform hierarchical clustering on the dataset. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
figsizetuple of size 2: width and height of the figure
fontsizeint: size of the legend, title and labels of the figure
dimred_argsdict: dictionary which lists the parameters to the dimensionality reduction method (either PCA, by default, or UMAP, if parameter “n_neighbors” is provided)
predictionsarray-like of shape (n_ratings, 3) or None: a matrix which contains the user indices (column 1), the item indices (column 2) and the class for the corresponding (user, item) pair (value in {-1, 0, 1} in column 3); if predictions=None, then the ground truth ratings will be used to color datapoints, otherwise, the predicted ratings will be used
use_ratingsbool: if set to True, use the ratings in the dataset as predictions (for debugging purposes)
random_stateint: random seed
show_errorsbool: boolean to assess whether to color according to the error in class prediction; if show_errors=False, then either the ground truth or the predicted class labels will be used to color the datapoints; otherwise, the points will be restricted to the set of known matchings (even if withzeros=True) and colored according to the identity between the ground truth and the predicted labels for each (user, item) pair
verbosebool: prints out information at each step

stanscofi.datasets.generate_dummy_dataset(npositive, nnegative, nfeatures, mean, std, random_state=12454)

Creates a dummy dataset where the positive and negative (item, user) pairs are arbitrarily similar.

Each of the nfeatures features for (item, user) pair feature vectors associated with positive ratings are drawn from a Gaussian distribution of mean mean and standard deviation std, whereas those for negative ratings are drawn from from a Gaussian distribution of mean -mean and standard deviation std. User and item feature matrices of shape (nfeatures//2, npositive+nnegative) are generated, which are the concatenation of npositive positive and nnegative negative pair feature vectors generated from Gaussian distributions. Thus there are npositive^2 positive ratings (each “positive” user with a “positive” item), nnegative^2 negative ratings (idem), and the remainder is unknown (that is, (npositive+nnegative)^2-npositive^2-nnegative^2 ratings).

…

Parameters

npositiveint: number of positive items/users
nnegativeint: number of negative items/users
nfeaturesint: number of item/user features
meanfloat: mean of generating Gaussian distributions
stdfloat: standard deviation of generating Gaussian distributions

Returns

ratingsarray-like of shape (n_items, n_users): a matrix which contains values in {-1, 0, 1} describing the known and unknown user-item matchings
usersarray-like of shape (n_item_features, n_items): a list of the item feature names in the order of column indices in ratings_mat
itemsarray-like of shape (n_user_features, n_users): a list of the item feature names in the order of column indices in ratings_mat