Datasets
- class stanscofi.datasets.Dataset(ratings=None, users=None, items=None, same_item_user_features=False, name='dataset')
Bases:
object
A class used to encode a drug repurposing dataset (items are drugs, users are diseases)
…
Parameters
- ratingsarray-like of shape (n_items, n_users)
an array which contains values in {-1, 0, 1, np.nan} describing the negative, unlabelled, positive, unavailable user-item matchings
- itemsarray-like of shape (n_item_features, n_items)
an array which contains the item feature vectors
- usersarray-like of shape (n_user_features, n_users)
an array which contains the user feature vectors
- same_item_user_featuresbool (default: False)
whether the item and user features are the same (optional)
- namestr
name of the dataset (optional)
Attributes
- namestr
name of the dataset
- ratingsCOO-array of shape (n_items, n_users)
an array which contains values in {-1, 0, 1} describing the negative, unlabelled/unavailable, positive user-item matchings
- foldsCOO-array of shape (n_items, n_users)
an array which contains values in {0, 1} describing the unavailable and available user-item matchings in ratings
- itemsCOO-array of shape (n_item_features, n_items)
an array which contains the user feature vectors (NaN for missing features)
- usersCOO-array of shape (n_user_features, n_users)
an array which contains the item feature vectors (NaN for missing features)
- item_listlist of str
a list of the item names in the order of row indices in ratings_mat
- user_listlist of str
a list of the user names in the order of column indices in ratings_mat
- item_featureslist of str
a list of the item feature names in the order of column indices in ratings_mat
- user_featureslist of str
a list of the user feature names in the order of column indices in ratings_mat
- same_item_user_featuresbool
whether the item and user features are the same
- nusersint
number of users
- nitemsint
number of items
- nuser_featuresint
number of user features
- nitem_featuresint
number of item features
Methods
- __init__(ratings=None, users=None, items=None, same_item_user_features=False, name=”dataset”)
Initialize the Dataset object and creates all attributes
- summary(sep=”-”*70)
Prints out the characteristics of the drug repurposing dataset
- visualize(withzeros=False, X=None, y=None, figsize=(5,5), fontsize=20, dimred_args={}, predictions=None, use_ratings=False, random_state=1234, show_errors=False, verbose=False)
Plots datapoints in the dataset annotated by the ground truth or predicted ratings
- subset(folds, subset_name=”subset”)
Creates a subset of the dataset based on the folds given as input
- subset(folds, subset_name='subset')
Obtains a subset of a stanscofi.Dataset based on a set of user and item indices
…
Parameters
- foldsCOO-array of shape (n_items, n_users)
an array which contains values in {0, 1} describing the unavailable and available user-item matchings in ratings
- subset_namestr
name of the newly created stanscofi.Dataset
Returns
- subsetstanscofi.Dataset
dataset corresponding to the folds in input
- summary(sep='----------------------------------------------------------------------')
Prints out a summary of the contents of a stanscofi.Dataset: the number of items, users, the number of positive, negative, unlabeled, unavailable matchings, the sparsity number, and the shape and percentage of missing values in the item and user feature matrices
…
Parameters
- sepstr
separator for pretty printing
…
Returns
- ndrugsint
number of drugs
- ndiseasesint
number of diseases
- ndrugs_knownint
number of drugs with at least one known (positive or negative) rating
- ndiseases_knownint
number of diseases with at least one known (positive or negative) rating
- npositiveint
number of positive ratings
- nnegativeint
number of negative ratings
- nunlabeled_unavailableint
number of unlabeled or unavailable ratings
- nunavailableint
number of unavailable ratings
- sparsityfloat
percentage of known ratings
- sparsity_knownfloat
percentage of known ratings among drugs and diseases with at least one known rating
- ndrug_featuresint
number of drug features
- missing_drug_featuresfloat
percentage of missing drug feature values
- ndisease_featuresint
number of disease features
- missing_disease_featuresfloat
percentage of missing disease feature values
- visualize(withzeros=False, X=None, y=None, metric='euclidean', figsize=(5, 5), fontsize=20, dimred_args={}, predictions=None, use_ratings=False, random_state=1234, show_errors=False, verbose=False)
Plots a representation of the datapoints in a stanscofi.Dataset which is annotated either by the ground truth labels or the predicted labels. The representation is the plot of the datapoints according to the first two Principal Components, or the first two dimensions in UMAP, if the feature matrices can be converted into a (n_ratings, n_features) shaped matrix where n_features>1, else it plots a heatmap with the values in the matrix for each rating pair.
In the legend, ground truth labels are denoted with brackets: e.g., [0] (unknown), [1] (positive) and [-1] (negative); predicted ratings are denoted by “pos” (positive) and “neg” (negative); correct (resp., incorrect) predictions are denoted by “correct”, resp. “error”
…
Parameters
- withzerosbool
boolean to assess whether (user, item) unknown matchings should also be plotted; if withzeros=False, then only (item, user) pairs associated with known matchings will be plotted (but the unknown matching datapoints will still be used to compute the dimensionality reduction); otherwise, all pairs will be plotted
- Xarray-like of shape (n_ratings, n_features) or None
(item, user) pair feature matrix
- yarray-like of shape (n_ratings, ) or None
response vector for each (item, user) pair in X; necessarily X should not be None if y is not None, and vice versa; setting X and y automatically overrides the other behaviors of this function
- metricstr
metric to consider to perform hierarchical clustering on the dataset. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
- figsizetuple of size 2
width and height of the figure
- fontsizeint
size of the legend, title and labels of the figure
- dimred_argsdict
dictionary which lists the parameters to the dimensionality reduction method (either PCA, by default, or UMAP, if parameter “n_neighbors” is provided)
- predictionsarray-like of shape (n_ratings, 3) or None
a matrix which contains the user indices (column 1), the item indices (column 2) and the class for the corresponding (user, item) pair (value in {-1, 0, 1} in column 3); if predictions=None, then the ground truth ratings will be used to color datapoints, otherwise, the predicted ratings will be used
- use_ratingsbool
if set to True, use the ratings in the dataset as predictions (for debugging purposes)
- random_stateint
random seed
- show_errorsbool
boolean to assess whether to color according to the error in class prediction; if show_errors=False, then either the ground truth or the predicted class labels will be used to color the datapoints; otherwise, the points will be restricted to the set of known matchings (even if withzeros=True) and colored according to the identity between the ground truth and the predicted labels for each (user, item) pair
- verbosebool
prints out information at each step
- stanscofi.datasets.generate_dummy_dataset(npositive, nnegative, nfeatures, mean, std, random_state=12454)
Creates a dummy dataset where the positive and negative (item, user) pairs are arbitrarily similar.
Each of the nfeatures features for (item, user) pair feature vectors associated with positive ratings are drawn from a Gaussian distribution of mean mean and standard deviation std, whereas those for negative ratings are drawn from from a Gaussian distribution of mean -mean and standard deviation std. User and item feature matrices of shape (nfeatures//2, npositive+nnegative) are generated, which are the concatenation of npositive positive and nnegative negative pair feature vectors generated from Gaussian distributions. Thus there are npositive^2 positive ratings (each “positive” user with a “positive” item), nnegative^2 negative ratings (idem), and the remainder is unknown (that is, (npositive+nnegative)^2-npositive^2-nnegative^2 ratings).
…
Parameters
- npositiveint
number of positive items/users
- nnegativeint
number of negative items/users
- nfeaturesint
number of item/user features
- meanfloat
mean of generating Gaussian distributions
- stdfloat
standard deviation of generating Gaussian distributions
Returns
- ratingsarray-like of shape (n_items, n_users)
a matrix which contains values in {-1, 0, 1} describing the known and unknown user-item matchings
- usersarray-like of shape (n_item_features, n_items)
a list of the item feature names in the order of column indices in ratings_mat
- itemsarray-like of shape (n_user_features, n_users)
a list of the item feature names in the order of column indices in ratings_mat