Datasets

class stanscofi.datasets.Dataset(ratings=None, users=None, items=None, same_item_user_features=False, name='dataset')

Bases: object

A class used to encode a drug repurposing dataset (items are drugs, users are diseases)

Parameters

ratingsarray-like of shape (n_items, n_users)

an array which contains values in {-1, 0, 1, np.nan} describing the negative, unlabelled, positive, unavailable user-item matchings

itemsarray-like of shape (n_item_features, n_items)

an array which contains the item feature vectors

usersarray-like of shape (n_user_features, n_users)

an array which contains the user feature vectors

same_item_user_featuresbool (default: False)

whether the item and user features are the same (optional)

namestr

name of the dataset (optional)

Attributes

namestr

name of the dataset

ratingsCOO-array of shape (n_items, n_users)

an array which contains values in {-1, 0, 1} describing the negative, unlabelled/unavailable, positive user-item matchings

foldsCOO-array of shape (n_items, n_users)

an array which contains values in {0, 1} describing the unavailable and available user-item matchings in ratings

itemsCOO-array of shape (n_item_features, n_items)

an array which contains the user feature vectors (NaN for missing features)

usersCOO-array of shape (n_user_features, n_users)

an array which contains the item feature vectors (NaN for missing features)

item_listlist of str

a list of the item names in the order of row indices in ratings_mat

user_listlist of str

a list of the user names in the order of column indices in ratings_mat

item_featureslist of str

a list of the item feature names in the order of column indices in ratings_mat

user_featureslist of str

a list of the user feature names in the order of column indices in ratings_mat

same_item_user_featuresbool

whether the item and user features are the same

nusersint

number of users

nitemsint

number of items

nuser_featuresint

number of user features

nitem_featuresint

number of item features

Methods

__init__(ratings=None, users=None, items=None, same_item_user_features=False, name=”dataset”)

Initialize the Dataset object and creates all attributes

summary(sep=”-”*70)

Prints out the characteristics of the drug repurposing dataset

visualize(withzeros=False, X=None, y=None, figsize=(5,5), fontsize=20, dimred_args={}, predictions=None, use_ratings=False, random_state=1234, show_errors=False, verbose=False)

Plots datapoints in the dataset annotated by the ground truth or predicted ratings

subset(folds, subset_name=”subset”)

Creates a subset of the dataset based on the folds given as input

subset(folds, subset_name='subset')

Obtains a subset of a stanscofi.Dataset based on a set of user and item indices

Parameters

foldsCOO-array of shape (n_items, n_users)

an array which contains values in {0, 1} describing the unavailable and available user-item matchings in ratings

subset_namestr

name of the newly created stanscofi.Dataset

Returns

subsetstanscofi.Dataset

dataset corresponding to the folds in input

summary(sep='----------------------------------------------------------------------')

Prints out a summary of the contents of a stanscofi.Dataset: the number of items, users, the number of positive, negative, unlabeled, unavailable matchings, the sparsity number, and the shape and percentage of missing values in the item and user feature matrices

Parameters

sepstr

separator for pretty printing

Returns

ndrugsint

number of drugs

ndiseasesint

number of diseases

ndrugs_knownint

number of drugs with at least one known (positive or negative) rating

ndiseases_knownint

number of diseases with at least one known (positive or negative) rating

npositiveint

number of positive ratings

nnegativeint

number of negative ratings

nunlabeled_unavailableint

number of unlabeled or unavailable ratings

nunavailableint

number of unavailable ratings

sparsityfloat

percentage of known ratings

sparsity_knownfloat

percentage of known ratings among drugs and diseases with at least one known rating

ndrug_featuresint

number of drug features

missing_drug_featuresfloat

percentage of missing drug feature values

ndisease_featuresint

number of disease features

missing_disease_featuresfloat

percentage of missing disease feature values

visualize(withzeros=False, X=None, y=None, metric='euclidean', figsize=(5, 5), fontsize=20, dimred_args={}, predictions=None, use_ratings=False, random_state=1234, show_errors=False, verbose=False)

Plots a representation of the datapoints in a stanscofi.Dataset which is annotated either by the ground truth labels or the predicted labels. The representation is the plot of the datapoints according to the first two Principal Components, or the first two dimensions in UMAP, if the feature matrices can be converted into a (n_ratings, n_features) shaped matrix where n_features>1, else it plots a heatmap with the values in the matrix for each rating pair.

In the legend, ground truth labels are denoted with brackets: e.g., [0] (unknown), [1] (positive) and [-1] (negative); predicted ratings are denoted by “pos” (positive) and “neg” (negative); correct (resp., incorrect) predictions are denoted by “correct”, resp. “error”

Parameters

withzerosbool

boolean to assess whether (user, item) unknown matchings should also be plotted; if withzeros=False, then only (item, user) pairs associated with known matchings will be plotted (but the unknown matching datapoints will still be used to compute the dimensionality reduction); otherwise, all pairs will be plotted

Xarray-like of shape (n_ratings, n_features) or None

(item, user) pair feature matrix

yarray-like of shape (n_ratings, ) or None

response vector for each (item, user) pair in X; necessarily X should not be None if y is not None, and vice versa; setting X and y automatically overrides the other behaviors of this function

metricstr

metric to consider to perform hierarchical clustering on the dataset. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

figsizetuple of size 2

width and height of the figure

fontsizeint

size of the legend, title and labels of the figure

dimred_argsdict

dictionary which lists the parameters to the dimensionality reduction method (either PCA, by default, or UMAP, if parameter “n_neighbors” is provided)

predictionsarray-like of shape (n_ratings, 3) or None

a matrix which contains the user indices (column 1), the item indices (column 2) and the class for the corresponding (user, item) pair (value in {-1, 0, 1} in column 3); if predictions=None, then the ground truth ratings will be used to color datapoints, otherwise, the predicted ratings will be used

use_ratingsbool

if set to True, use the ratings in the dataset as predictions (for debugging purposes)

random_stateint

random seed

show_errorsbool

boolean to assess whether to color according to the error in class prediction; if show_errors=False, then either the ground truth or the predicted class labels will be used to color the datapoints; otherwise, the points will be restricted to the set of known matchings (even if withzeros=True) and colored according to the identity between the ground truth and the predicted labels for each (user, item) pair

verbosebool

prints out information at each step

stanscofi.datasets.generate_dummy_dataset(npositive, nnegative, nfeatures, mean, std, random_state=12454)

Creates a dummy dataset where the positive and negative (item, user) pairs are arbitrarily similar.

Each of the nfeatures features for (item, user) pair feature vectors associated with positive ratings are drawn from a Gaussian distribution of mean mean and standard deviation std, whereas those for negative ratings are drawn from from a Gaussian distribution of mean -mean and standard deviation std. User and item feature matrices of shape (nfeatures//2, npositive+nnegative) are generated, which are the concatenation of npositive positive and nnegative negative pair feature vectors generated from Gaussian distributions. Thus there are npositive^2 positive ratings (each “positive” user with a “positive” item), nnegative^2 negative ratings (idem), and the remainder is unknown (that is, (npositive+nnegative)^2-npositive^2-nnegative^2 ratings).

Parameters

npositiveint

number of positive items/users

nnegativeint

number of negative items/users

nfeaturesint

number of item/user features

meanfloat

mean of generating Gaussian distributions

stdfloat

standard deviation of generating Gaussian distributions

Returns

ratingsarray-like of shape (n_items, n_users)

a matrix which contains values in {-1, 0, 1} describing the known and unknown user-item matchings

usersarray-like of shape (n_item_features, n_items)

a list of the item feature names in the order of column indices in ratings_mat

itemsarray-like of shape (n_user_features, n_users)

a list of the item feature names in the order of column indices in ratings_mat