stanscofi package
Submodules
stanscofi.datasets module
- class stanscofi.datasets.Dataset(ratings=None, users=None, items=None, same_item_user_features=False, name='dataset')
Bases:
object
A class used to encode a drug repurposing dataset (items are drugs, users are diseases)
…
Parameters
- ratingsarray-like of shape (n_items, n_users)
an array which contains values in {-1, 0, 1, np.nan} describing the negative, unlabelled, positive, unavailable user-item matchings
- itemsarray-like of shape (n_item_features, n_items)
an array which contains the item feature vectors
- usersarray-like of shape (n_user_features, n_users)
an array which contains the user feature vectors
- same_item_user_featuresbool (default: False)
whether the item and user features are the same (optional)
- namestr
name of the dataset (optional)
Attributes
- namestr
name of the dataset
- ratingsCOO-array of shape (n_items, n_users)
an array which contains values in {-1, 0, 1} describing the negative, unlabelled/unavailable, positive user-item matchings
- foldsCOO-array of shape (n_items, n_users)
an array which contains values in {0, 1} describing the unavailable and available user-item matchings in ratings
- itemsCOO-array of shape (n_item_features, n_items)
an array which contains the user feature vectors (NaN for missing features)
- usersCOO-array of shape (n_user_features, n_users)
an array which contains the item feature vectors (NaN for missing features)
- item_listlist of str
a list of the item names in the order of row indices in ratings_mat
- user_listlist of str
a list of the user names in the order of column indices in ratings_mat
- item_featureslist of str
a list of the item feature names in the order of column indices in ratings_mat
- user_featureslist of str
a list of the user feature names in the order of column indices in ratings_mat
- same_item_user_featuresbool
whether the item and user features are the same
- nusersint
number of users
- nitemsint
number of items
- nuser_featuresint
number of user features
- nitem_featuresint
number of item features
Methods
- __init__(ratings=None, users=None, items=None, same_item_user_features=False, name=”dataset”)
Initialize the Dataset object and creates all attributes
- summary(sep=”-”*70)
Prints out the characteristics of the drug repurposing dataset
- visualize(withzeros=False, X=None, y=None, figsize=(5,5), fontsize=20, dimred_args={}, predictions=None, use_ratings=False, random_state=1234, show_errors=False, verbose=False)
Plots datapoints in the dataset annotated by the ground truth or predicted ratings
- subset(folds, subset_name=”subset”)
Creates a subset of the dataset based on the folds given as input
- subset(folds, subset_name='subset')
Obtains a subset of a stanscofi.Dataset based on a set of user and item indices
…
Parameters
- foldsCOO-array of shape (n_items, n_users)
an array which contains values in {0, 1} describing the unavailable and available user-item matchings in ratings
- subset_namestr
name of the newly created stanscofi.Dataset
Returns
- subsetstanscofi.Dataset
dataset corresponding to the folds in input
- summary(sep='----------------------------------------------------------------------')
Prints out a summary of the contents of a stanscofi.Dataset: the number of items, users, the number of positive, negative, unlabeled, unavailable matchings, the sparsity number, and the shape and percentage of missing values in the item and user feature matrices
…
Parameters
- sepstr
separator for pretty printing
…
Returns
- ndrugsint
number of drugs
- ndiseasesint
number of diseases
- ndrugs_knownint
number of drugs with at least one known (positive or negative) rating
- ndiseases_knownint
number of diseases with at least one known (positive or negative) rating
- npositiveint
number of positive ratings
- nnegativeint
number of negative ratings
- nunlabeled_unavailableint
number of unlabeled or unavailable ratings
- nunavailableint
number of unavailable ratings
- sparsityfloat
percentage of known ratings
- sparsity_knownfloat
percentage of known ratings among drugs and diseases with at least one known rating
- ndrug_featuresint
number of drug features
- missing_drug_featuresfloat
percentage of missing drug feature values
- ndisease_featuresint
number of disease features
- missing_disease_featuresfloat
percentage of missing disease feature values
- visualize(withzeros=False, X=None, y=None, metric='euclidean', figsize=(5, 5), fontsize=20, dimred_args={}, predictions=None, use_ratings=False, random_state=1234, show_errors=False, verbose=False)
Plots a representation of the datapoints in a stanscofi.Dataset which is annotated either by the ground truth labels or the predicted labels. The representation is the plot of the datapoints according to the first two Principal Components, or the first two dimensions in UMAP, if the feature matrices can be converted into a (n_ratings, n_features) shaped matrix where n_features>1, else it plots a heatmap with the values in the matrix for each rating pair.
In the legend, ground truth labels are denoted with brackets: e.g., [0] (unknown), [1] (positive) and [-1] (negative); predicted ratings are denoted by “pos” (positive) and “neg” (negative); correct (resp., incorrect) predictions are denoted by “correct”, resp. “error”
…
Parameters
- withzerosbool
boolean to assess whether (user, item) unknown matchings should also be plotted; if withzeros=False, then only (item, user) pairs associated with known matchings will be plotted (but the unknown matching datapoints will still be used to compute the dimensionality reduction); otherwise, all pairs will be plotted
- Xarray-like of shape (n_ratings, n_features) or None
(item, user) pair feature matrix
- yarray-like of shape (n_ratings, ) or None
response vector for each (item, user) pair in X; necessarily X should not be None if y is not None, and vice versa; setting X and y automatically overrides the other behaviors of this function
- metricstr
metric to consider to perform hierarchical clustering on the dataset. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
- figsizetuple of size 2
width and height of the figure
- fontsizeint
size of the legend, title and labels of the figure
- dimred_argsdict
dictionary which lists the parameters to the dimensionality reduction method (either PCA, by default, or UMAP, if parameter “n_neighbors” is provided)
- predictionsarray-like of shape (n_ratings, 3) or None
a matrix which contains the user indices (column 1), the item indices (column 2) and the class for the corresponding (user, item) pair (value in {-1, 0, 1} in column 3); if predictions=None, then the ground truth ratings will be used to color datapoints, otherwise, the predicted ratings will be used
- use_ratingsbool
if set to True, use the ratings in the dataset as predictions (for debugging purposes)
- random_stateint
random seed
- show_errorsbool
boolean to assess whether to color according to the error in class prediction; if show_errors=False, then either the ground truth or the predicted class labels will be used to color the datapoints; otherwise, the points will be restricted to the set of known matchings (even if withzeros=True) and colored according to the identity between the ground truth and the predicted labels for each (user, item) pair
- verbosebool
prints out information at each step
- stanscofi.datasets.generate_dummy_dataset(npositive, nnegative, nfeatures, mean, std, random_state=12454)
Creates a dummy dataset where the positive and negative (item, user) pairs are arbitrarily similar.
Each of the nfeatures features for (item, user) pair feature vectors associated with positive ratings are drawn from a Gaussian distribution of mean mean and standard deviation std, whereas those for negative ratings are drawn from from a Gaussian distribution of mean -mean and standard deviation std. User and item feature matrices of shape (nfeatures//2, npositive+nnegative) are generated, which are the concatenation of npositive positive and nnegative negative pair feature vectors generated from Gaussian distributions. Thus there are npositive^2 positive ratings (each “positive” user with a “positive” item), nnegative^2 negative ratings (idem), and the remainder is unknown (that is, (npositive+nnegative)^2-npositive^2-nnegative^2 ratings).
…
Parameters
- npositiveint
number of positive items/users
- nnegativeint
number of negative items/users
- nfeaturesint
number of item/user features
- meanfloat
mean of generating Gaussian distributions
- stdfloat
standard deviation of generating Gaussian distributions
Returns
- ratingsarray-like of shape (n_items, n_users)
a matrix which contains values in {-1, 0, 1} describing the known and unknown user-item matchings
- usersarray-like of shape (n_item_features, n_items)
a list of the item feature names in the order of column indices in ratings_mat
- itemsarray-like of shape (n_user_features, n_users)
a list of the item feature names in the order of column indices in ratings_mat
stanscofi.models module
- class stanscofi.models.BasicModel(params)
Bases:
object
A class used to encode a drug repurposing model
…
Parameters
- paramsdict
dictionary which contains method-wise parameters
Attributes
- namestr
the name of the model
- modeldepends on the implemented method
may contain an instance of a class of sklearn classifiers
- …
other attributes might be present depending on the type of model
Methods
- __init__(params)
Initializes the model with preselected parameters
- fit(train_dataset, seed=1234)
Preprocesses and fits the model
- predict_proba(test_dataset)
Outputs properly formatted predictions of the fitted model on test_dataset
- predict(scores)
Applies the following decision rule: if score<threshold, then return the negative label, otherwise return the positive label
- recommend_k_pairs(dataset, k=1, threshold=None)
Outputs the top-k (item, user) candidates (or candidates which score is higher than a threshold) in the input dataset
- print_scores(scores)
Prints out information about scores
- print_classification(predictions)
Prints out information about predicted labels
- preprocessing(train_dataset) [not implemented in BasicModel]
Preprocess the input dataset into something that is an input to the self.model_fit if it exists
- model_fit(train_dataset) [not implemented in BasicModel]
Fits the model on train_dataset
- model_predict_proba(test_dataset) [not implemented in BasicModel]
Outputs predictions of the fitted model on test_dataset
- fit(train_dataset, seed=1234)
Fitting the model on the training dataset.
Not implemented in the BasicModel class.
…
Parameters
- train_datasetstanscofi.Dataset
training dataset on which the model should fit
- seedint (default: 1234)
random seed
- model_fit()
Fitting the model on the training dataset.
<Not implemented in the BasicModel class.>
…
Parameters
- ……
appropriate inputs to the classifier (vary across algorithms)
- model_predict_proba()
Making predictions using the model on the testing dataset.
<Not implemented in the BasicModel class.>
…
Parameters
- ……
appropriate inputs to the classifier (vary across algorithms)
…
Returns
- scoresarray_like of shape (n_items, n_users)
prediction values by the model
- predict(scores, threshold=0.5)
- Outputs class labels based on the scores, using the following formula
prediction = -1 if (score<threshold) else 1
…
Parameters
- scoresCOO-array of shape (n_items, n_users)
sparse matrix in COOrdinate format
- thresholdfloat
the threshold of classification into the positive class
Returns
- predictionsCOO-array of shape (n_items, n_users)
sparse matrix in COOrdinate format with values in {-1,1}
- predict_proba(test_dataset, default_zero_val=1e-31)
Outputs properly formatted scores (not necessarily in [0,1]!) from the fitted model on test_dataset. Internally calls model_predict() then reformats the scores
…
Parameters
- test_datasetstanscofi.Dataset
dataset on which predictions should be made
Returns
- scoresCOO-array of shape (n_items, n_users)
sparse matrix in COOrdinate format, with nonzero values corresponding to predictions on available pairs in the dataset
- preprocessing(dataset, is_training=True)
Preprocessing step, which converts elements of a dataset (ratings matrix, user feature matrix, item feature matrix) into appropriate inputs to the classifier (e.g., X feature matrix for each (user, item) pair, y response vector).
<Not implemented in the BasicModel class.>
…
Parameters
- datasetstanscofi.Dataset
dataset to convert
- is_trainingbool
is the preprocessing prior to training (true) or testing (false)?
Returns
- ……
appropriate inputs to the classifier (vary across algorithms)
- print_classification(predictions)
Prints out information about the predicted classes
…
Parameters
- predictionsCOO-array
sparse matrix in COOrdinate format
- print_scores(scores)
Prints out information about the scores
…
Parameters
- scoresCOO-array
sparse matrix in COOrdinate format
- recommend_k_pairs(dataset, k=1, threshold=None)
Outputs the top-k (item, user) candidates (or candidates which score is higher than a threshold) in the input dataset
…
Parameters
- datasetstanscofi.Dataset
dataset on which predictions should be made
- kint or None (default: 1)
number of pair candidates to return (with ties)
- thresholdfloat or None (default: 0)
threshold on candidate scores. If k is not None, k best candidates are returned independently of the value of threshold
…
Parameters
- candidateslist of tuples of size 3
list of (item, user, score) candidates (by name as present in the dataset)
- class stanscofi.models.LogisticRegression(params)
Bases:
BasicModel
Logistic Regression (calls sklearn.linear_model.LogisticRegression internally). It uses the very same parameters as sklearn.linear_model.LogisticRegression, so please refer to help(sklearn.linear_model.LogisticRegression).
…
Parameters
- paramsdict
dictionary which contains sklearn.linear_model.LogisticRegression parameters, plus a key called “preprocessing” which determines which preprocessing function (in stanscofi.preprocessing) should be applied to data, plus a key called “subset” which gives the maximum number of features to consider in the model (those features will be the Top-subset in terms of variance across samples)
Attributes
Same as BasicModel class
Methods
Same as BasicModel class preprocessing(train_dataset)
Preprocesses the input dataset into something that is an input to fit
- model_fit(train_dataset)
Preprocesses and fits the model
- model_predict_proba(test_dataset)
Outputs predictions of the fitted model on test_dataset
- model_fit(X, y)
Fitting the Logistic Regression model on the training dataset.
…
Parameters
- Xarray-like of shape (n_ratings, n_pair_features)
(user, item) feature matrix (the actual contents of the matrix depends on parameters “preprocessing” and “subset” given as input
- yarray-like of shape (n_ratings, )
response vector for each (user, item) pair
- model_predict_proba(X)
Making predictions using the Logistic Regression model on the testing dataset.
…
Parameters
- Xarray-like of shape (n_ratings, n_pair_features)
(user, item) feature matrix (the actual contents of the matrix depends on parameters “preprocessing” and “subset” given as input
- preprocessing(dataset, is_training=True)
Preprocessing step, which converts elements of a dataset (ratings matrix, user feature matrix, item feature matrix) into appropriate inputs to the Logistic Regression classifier.
…
Parameters
- datasetstanscofi.Dataset
dataset to convert
- is_trainingbool
is the preprocessing prior to training (true) or testing (false)?
Returns
args : contains X : array-like of shape (n_ratings, n_pair_features)
(user, item) feature matrix (the actual contents of the matrix depends on parameters “preprocessing” and “subset” given as input
- yarray-like of shape (n_ratings, )
response vector for each (user, item) pair
- class stanscofi.models.NMF(params)
Bases:
BasicModel
Non-negative Matrix Factorization (calls sklearn.decomposition.NMF internally). It uses the very same parameters as sklearn.decomposition.NMF, so please refer to help(sklearn.decomposition.NMF).
…
Parameters
- paramsdict
dictionary which contains sklearn.decomposition.NMF parameters
Attributes
Same as BasicModel class
Methods
Same as BasicModel class preprocessing(train_dataset)
Preprocesses the input dataset into something that is an input to fit
- model_fit(train_dataset)
Preprocesses and fits the model
- model_predict_proba(test_dataset)
Outputs predictions of the fitted model on test_dataset
- model_fit(input)
Fitting the NMF model on the preprocessed training dataset.
…
Parameters
- inputarray-like of shape (n_samples,n_features)
training data
- model_predict_proba(input)
Making predictions using the NMF model on the testing dataset.
…
Parameters
- inputarray-like of shape (n_samples,n_features)
testing data
…
Returns
result : array-like of shape (n_samples,n_features)
- preprocessing(dataset, is_training=True)
Preprocessing step, which converts elements of a dataset (ratings matrix, user feature matrix, item feature matrix) into appropriate inputs to the NMF classifier.
…
Parameters
- datasetstanscofi.Dataset
dataset to convert
- is_trainingbool
is the preprocessing prior to training (true) or testing (false)?
Returns
args : contains A : array-like of shape (n_users, n_items)
contains the transposed translated association matrix so that all its values are non-negative
stanscofi.preprocessing module
- class stanscofi.preprocessing.CustomScaler(posinf, neginf)
Bases:
object
A class used to encode a simple preprocessing pipeline for feature matrices. Does mean imputation for features, feature filtering, correction of infinity errors and standardization
…
Parameters
- posinfint
Value to replace infinity (positive) values
- neginfint
Value to replace infinity (negative) values
Attributes
- imputerNone or sklearn.impute.SimpleImputer instance
Class for imputation of values
- scalerNone or sklearn.preprocessing.StandardScaler
Class for standardization of values
- filterNone or list
List of selected features (Top-N in terms of variance)
Methods
- __init__(params)
Initialize the scaler (with unfitted attributes)
- fit_transform(mat, subset=None, verbose=False)
Fits classes and transforms a matrix
- fit_transform(mat, subset=None, verbose=False)
Fits each attribute of the scaler and transform a feature matrix. Does mean imputation for features, feature filtering, correction of infinity errors and standardization
…
Parameters
- matarray-like of shape (n_samples, n_features)
matrix which should be preprocessed
- subsetNone or int
number of features to keep in feature matrix (Top-N in variance); if it is None, attribute filter is either initialized (if it is equal to None) or used to filter features
- verbosebool
prints out information
Returns
- mat_nanarray-like of shape (n_samples, n_features)
Preprocessed matrix
- stanscofi.preprocessing.Perlman_procedure(dataset, njobs=1, sep_feature=None, missing=-666, inf=2, verbose=False)
Method for combining (several) item and user similarity matrices (reference DOI: 10.1089/cmb.2010.0213). Instead of concatenating item and user features for a given pair, resulting in a vector of size (n_items x n_item_matrices)+(n_users x n_user_matrices), compute a single score per pair of (item_matrix, user_matrix) for each (item, user) pair, resulting in a vector of size (n_item_matrices) x (n_user_matrices).
The score for any item i, user u, item-item similarity fi and user-user similarity fu is score_{fi,fu}(i,u) = max { sqrt(fi(dr, dr’) x fu(di’, di)) | (i’,u’)!=(i,u), fi(dr, dr’)!=NaN, fu(di’, di)!=NaN, rating(i’,u’)!=0 }
Then the final feature matrix is X = (X_{i,j})_{i,j} for i a (item, user) pair and j a (item similarity, user similarity) pair
…
Parameters
- datasetstanscofi.Dataset
- dataset which should be transformed, with n_items items (with n_item_features features) and n_users users (with n_user_features features) with the following attributes
- ratingsCOO-array of shape (n_items, n_users)
an array which contains values in {-1, 0, 1} describing the negative, unlabelled/unavailable, positive user-item matchings
- itemsCOO-array of shape (n_item_features, n_items)
concatenation of n_drug_features drug similarity matrices of shape (n_drugs, n_drugs), where values in item_features are denoted by “<feature><sep_feature><drug>” and missing values are denoted by numpy.nan; if the prefix in “<feature><sep_feature>” is missing, it is assumed that items is a single similarity matrix (n_item_matrices=1)
- usersCOO-array of shape (n_user_features, n_users)
concatenation of n_disease_features drug similarity matrices of shape (n_diseases, n_diseases), where values in user_features are denoted by “<feature><sep_feature><disease>” and missing values are denoted by numpy.nan; if the prefix in “<feature><sep_feature>” is missing, it is assumed that users is a single similarity matrix (n_user_matrices=1)
NaN values are replaced by 0, whereas infinite values are replaced by inf (parameter below).
- njobsint
number of jobs to run in parallel
- sep_featurestr or None
separator between feature type and element in the feature matrices in dataset. None if there is one single feature type expected
- missingint
placeholder value that should be different from any feature name
- infint
Value that replaces infinite values in the dataset (inf for +infinity, -inf for -infinity)
- verbosebool
prints out information
Returns
- Xarray-like of shape (n_items x n_users, n_item_features x n_user_features)
the feature matrix
- yarray-like of shape (n_items x n_users, )
the response/outcome vector
- stanscofi.preprocessing.cartesian_product_transpose(*arrays)
- stanscofi.preprocessing.meanimputation_standardize(dataset, subset=None, scalerS=None, scalerP=None, inf=10, verbose=False)
Computes a single feature matrix and response vector from a drug repurposing dataset, by imputation by the average value of a feature for missing values and by centering and standardizing user and item feature matrices and concatenating them
…
Parameters
- datasetstanscofi.Dataset
dataset which should be transformed, with n_items items (with n_item_features features) and n_users users (with n_user_features features) where missing values are denoted by numpy.nan
- subsetNone or int
number of features to keep in item feature matrix, and in user feature matrix (selecting the ones with highest variance)
- scalerSNone or sklearn.preprocessing.StandardScaler instance
scaler for items
- scalerPNone or sklearn.preprocessing.StandardScaler instance
scaler for users
- verbosebool
prints out information
Returns
- Xarray-like of shape (n_folds, n_item_features+n_user_features)
the feature matrix
- yarray-like of shape (n_folds, )
the response/outcome vector
- scalerSNone or stanscofi.models.CustomScaler instance
scaler for items; if the input value was None, returns the scaler fitted on item feature vectors
- scalerPNone or stanscofi.models.CustomScaler instance
scaler for users; if the input value was None, returns the scaler fitted on user feature vectors
- stanscofi.preprocessing.preprocessing_XY(dataset, preprocessing_str, operator='*', sep_feature='-', subset_=None, filter_=None, scalerS=None, scalerP=None, inf=2, njobs=1)
Converts a score vector or a score value into a list of scores
…
Parameters
- datasetstanscofi.datasets.Dataset
dataset to preprocess
- preprocessing_strstr
type of preprocessing: in [“Perlman_procedure”,”meanimputation_standardize”,”same_feature_preprocessing”].
- subset_None or int
Number of features to restrict the dataset to (Top-subset_ features in terms of cross-sample variance) /!across user and item features if preprocessing_str!=”meanimputation_standardize” otherwise 2*subset_ features are preserved (subset_ for item features, subset_ for user features)
- operatorNone or str
arithmetric operation to apply, ex. “+”, “*”
- sep_featurestr
separator between feature type and element in the feature matrices in dataset
- filter_None or list
list of feature indices to keep (of length subset_) (overrides the subset_ parameter if both are fed)
- scalerSNone or stanscofi.models.CustomScaler instance
scaler for items; the scaler fitted on item feature vectors
- scalerPNone or stanscofi.models.CustomScaler instance
scaler for users; the scaler fitted on user feature vectors
- inffloat or int
placeholder value for infinity values (positive : +inf, negative : -inf)
- njobsint
number of jobs to run in parallel (njobs > 0) for the Perlman procedure
Returns
- Xarray-like of shape (n_folds, n_features)
the feature matrix
- yarray-like of shape (n_folds, )
the response/outcome vector
- scalerSNone or stanscofi.models.CustomScaler instance
scaler for items; if the input value was None, returns the scaler fitted on item feature vectors
- scalerPNone or stanscofi.models.CustomScaler instance
scaler for users; if the input value was None, returns the scaler fitted on user feature vectors
- filter_None or list
list of feature indices to keep (of length subset_)
- stanscofi.preprocessing.same_feature_preprocessing(dataset, operator)
If the users and items have the same features in the dataset, then a simple way to combine the user and item feature matrices is to apply an element-wise arithmetic operator (*, +, etc.) to the feature vectors coefficient per coefficient.
…
Parameters
- datasetstanscofi.Dataset
dataset which should be transformed, where n_item_features==n_user_features and dataset.same_item_user_features==True
- operatorstr
arithmetric operation to apply, ex. “+”, “*”
Returns
- Xarray-like of shape (n_folds, n_features)
the feature matrix
- yarray-like of shape (n_folds, )
the response/outcome vector
stanscofi.training_testing module
- stanscofi.training_testing.cv_training(template, params, train_dataset, nsplits, metric, k=1, beta=1, threshold=0, test_size=0.2, dist_type='cosine', cv_type='random', early_stop=2, njobs=1, random_state=1234, show_plots=False, verbose=False)
Trains a model on a dataset using cross-validation and custom metrics using sklearn.model_selection.StratifiedKFold
…
Parameters
- templatestanscofi.BasicModel or subclass
type of model to train
- paramsdict
dictionary of parameters to initialize the model
- train_datasetstanscofi.Dataset
dataset to train upon
- nsplitsint
number of cross-validation steps
- metricstr
metric to optimize the model upon. Implemented metrics are in validation.py
- kint (default: 1)
Argument of the metric to optimize. Implemented metrics are in validation.py
- betafloat (default: 1)
Argument of the metric to optimize. Implemented metrics are in validation.py
- thresholdfloat (default: 0)
decision threshold
- test_sizefloat (default: 0.2)
percentage of testing set (if cv_type=”weakly_correlated”)
- dist_typestr (default: “cosine”)
type of metric for splitting (if cv_type=”weakly_correlated”)
- cv_typestr (default: “random”)
type of split to apply to the dataset. Can either be “random” or “weakly_correlated”
- early_stopint or None
positive integer, which stops the cluster number search after 3 tries yielding the same number; note that if early_stop is not None, then the property on test_size will not necessarily hold anymore
- njobsint (default: 1)
number of jobs to run in parallel. Should be lower than nsplits-1
- random_stateint (default: 1234)
random seed
- show_plotsbool (default: False)
shows the validation plots at each cross-validation step
- verbosebool (default: False)
prints out information
Returns
- resultsdict
- a dictionary which contains
- “models”list of subinstances of stanscofi.models.BasicModel of length nsplits
all trained models
- “train_metric”list of floats of length nsplits
all metrics on training sets
- “test_metric”list of floats of length nsplits
all metrics on testing sets
- “cv_folds”list of COO-array of shape (n_items, n_users) of length nsplits
the training and testing folds for each split
- stanscofi.training_testing.grid_search(search_params, template, params, train_dataset, nsplits, metric, k=1, beta=1, threshold=0, test_size=0.2, dist_type='cosine', cv_type='random', early_stop=2, njobs=1, random_state=1234, show_plots=False, verbose=False)
Grid-search over hyperparameters, iteratively optimizing over one parameter at a time, and internally calling cv_training.
…
Parameters
- search_paramsdict
a dictionary which contains as keys the hyperparameter names and as values the corresponding intervals to explore during the grid-search
- templatestanscofi.BasicModel or subclass
type of model to train
- paramsdict
dictionary of parameters to initialize the model
- train_datasetstanscofi.Dataset
dataset to train upon
- metricstr
metric to optimize the model upon. Implemented metrics are in validation.py
- kint (default: 1)
Argument of the metric to optimize. Implemented metrics are in validation.py
- betafloat (default: 1)
Argument of the metric to optimize. Implemented metrics are in validation.py
- thresholdfloat (default: 0)
decision threshold
- test_sizefloat (default: 0.2)
percentage of testing set (if cv_type=”weakly_correlated”)
- dist_typestr (default: “cosine”)
type of metric for splitting (if cv_type=”weakly_correlated”)
- cv_typestr (default: “random”)
type of split to apply to the dataset. Can either be “random” or “weakly_correlated”
- njobsint (default: 1)
number of jobs to run in parallel. Should be lower than nsplits-1
- random_stateint (default: 1234)
random seed
- show_plotsbool (default: False)
shows the validation plots at each cross-validation step
- verbosebool (default: False)
prints out information
Returns
- best_paramsdict
a dictionary which contains as keys the hyperparameter names and as values the best values obtained across all grid-search steps
- best_modelsubinstance of stanscofi.models.BasicModel
the best trained model associated with the best parameters
- metricsdict
- a dictionary which contains
- “train_metric”float
the metric on the training set on the best crossvalidation split for the best set of parameters
- “test_metric”float
the metric on the testing set on the best crossvalidation split for the best set of parameters
- stanscofi.training_testing.indices_to_folds(indices, indices_array, shape)
Converts indices of datapoints into folds as defined in stanscofi
…
Parameters
- indicesarray-like of size (n_selected_ratings, )
flat indices of selected datapoints
- indices_arrayarray-like of size (n_total_ratings, 2)
corresponding row and column indices of datapoints
- shapetuple of integers of size 2
total numbers of rows and columns
Returns
- foldsCOO-array of shape shape
folds which can be fed to other functions in stanscofi, e.g., dataset.subset(folds)
- stanscofi.training_testing.random_cv_split(dataset, cv_generator, metric='cosine')
Splits the data into training and testing datasets randomly for cross-validation.
…
Parameters
- datasetstanscofi.Dataset
dataset to split
- cv_generatorscikit-learn cross-validation index generator
e.g. StratifiedKFold, KFold
- metricstr
metric to consider to assess distance between training and testing sets. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
Returns
- cv_foldslist of size nsplits of COO-array of shape (n_items, n_users)
list of arrays which contain values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set
- dist_lstlist of size nsplits of tuples of float of size 3
for each fold, minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set
- stanscofi.training_testing.random_simple_split(dataset, test_size, metric='cosine', random_state=1234)
Splits the data into training and testing datasets randomly.
…
Parameters
- datasetstanscofi.Dataset
dataset to split
- test_sizefloat
value between 0 and 1 (strictly) which indicates the maximum percentage of initial data (positive and negative ratings) being assigned to the test dataset
- metricstr
metric to consider to assess distance between training and testing sets. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
- random_stateint
random seed
Returns
- cv_foldslist of COO-array of shape (n_items, n_users)
list of arrays which contain values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set
- dist_train_test, dist_train, dist_testfloat
minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set
Splits the data into training and testing datasets with a low correlation among items, by applying a hierarchical clustering on the item feature matrix. NaNs in the item feature matrix are converted to 0.
…
Parameters
- datasetstanscofi.Dataset
dataset to split
- test_sizefloat
value between 0 and 1 (strictly) which indicates the maximum percentage of initial data (positive and negative ratings) being assigned to the test dataset
- early_stopint or None
positive integer, which stops the cluster number search after 3 tries yielding the same number; note that if early_stop is not None, then the property on test_size will not necessarily hold anymore
- metricstr
metric to consider to perform hierarchical clustering on the dataset. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
- random_stateint
random seed
- niterint
maximum number of iterations of the clustering loop
- verbosebool
prints out information
Returns
- train_folds, test_foldsCOO-array of shape (n_items, n_users)
an array which contains values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set
- dist_train_test, dist_train, dist_testfloat
minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set
stanscofi.utils module
- stanscofi.utils.compute_sparsity(df)
Computes the sparsity number of a collaborative filtering dataset
…
Parameters
- dfpandas.DataFrame of shape (n_items, n_users)
the matrix of ratings where unknown matchings are denoted with 0
Returns
- sparsityfloat
the percentage of non missing values in the matrix of ratings
- stanscofi.utils.load_dataset(model_name, save_folder='./', sep_feature='-')
Loads a drug repurposing dataset
…
Parameters
- model_namestr
the name of the dataset to load. Should belong to the following list: [“Gottlieb”, “DNdataset”, “Cdataset”, “LRSSL”, “PREDICT_Gottlieb”, “TRANSCRIPT”, “PREDICT”]
- save_folderstr
the path to the folder where dataset-related files are or will be stored
Returns
- dataset_didictionary
a dictionary where key “ratings” contains the drug-disease matching pandas.DataFrame of shape (n_drugs, n_diseases) (where missing values are denoted by 0), key “users” correspond to the disease pandas.DataFrame of shape (n_disease_features, n_diseases), and “items” correspond to the drug feature pandas.DataFrame of shape (n_drug_features, n_drugs)
- stanscofi.utils.matrix2ratings(df, user_col='user', item_col='item', rating_col='rating')
Converts a matrix into a list of ratings
…
Parameters
- dfpandas.DataFrame of shape (n_items, n_users)
the matrix of ratings in {-1, 1, 0} where unknown matchings are denoted with 0
- user_colstr
column denoting users
- item_colstr
column denoting items
- rating_colstr
column denoting ratings in {-1, 0, 1}
Returns
- ratingspandas.DataFrame of shape (n_ratings, 3)
the list of known ratings where the first column correspond to users, second to items, third to ratings
- stanscofi.utils.merge_ratings(rating_dfs, user_col, item_col, rating_col)
Merges rating lists from several sources by solving conflicts. Conflicting ratings are resolved as follows: if there is at least one negative rating (-1) reported for a (drug, disease) pair, then the final rating is negative (-1); if there is at least one positive rating (1) and no negative rating (-1) reported, then the final rating is positive (1)
…
Parameters
- rating_dfslist of pandas.DataFrame of shape (n_ratings, 3)
the list of rating lists where one column (of name user_col) is associated with users, one column (of name item_col) is associated with items, and one column (of name rating_col) is associated with ratings in {-1, 0, 1}
- user_colstr
column denoting users
- item_colstr
column denoting items
- rating_colstr
column denoting ratings in {-1, 0, 1}
verbose : bool
Returns
- rating_dfpandas.DataFrame of shape (n_ratings, 3)
the list of rating lists where one column (of name user_col) is associated with users, one column (of name item_col) is associated with items, and one column (of name rating_col) is associated with ratings in {-1, 0, 1}
- stanscofi.utils.print_dataset(ratings, user_col, item_col, rating_col)
Prints values of a drug repurposing dataset
…
Parameters
- ratingspandas.DataFrame of shape (n_ratings, 3)
the list of ratings with columns user_col, item_col, rating_col
- user_colstr
column denoting users
- item_colstr
column denoting items
- rating_colstr
column denoting ratings in {-1, 0, 1}
Returns
None
Prints
The number of items/drugs, users/diseases, and the number of positive (1), negative (-1) and unknown (0) matchings.
- stanscofi.utils.ratings2matrix(ratings, user_col, item_col, rating_col)
Converts a list of ratings into a matrix
…
Parameters
- ratingspandas.DataFrame of shape (n_ratings, 3)
the list of known ratings where the first column (user_col) correspond to users, second (item_col) to items, third (rating_col) to ratings in {-1,0,1}
- user_colstr
column denoting users
- item_colstr
column denoting items
- rating_colstr
column denoting ratings in {-1, 0, 1}
Returns
- dfpandas.DataFrame of shape (n_items, n_users)
the matrix of ratings in {-1, 1, 0} where unknown matchings are denoted with 0
stanscofi.validation module
- stanscofi.validation.AP(y_true, y_pred, u, u1)
- stanscofi.validation.AUC(y_true, y_pred, k, u1)
- stanscofi.validation.DCGk(y_true, y_pred, k, u1)
- stanscofi.validation.ERR(y_true, y_pred, max=10, max_grade=2)
source: https://raw.githubusercontent.com/skondo/evaluation_measures/master/evaluations_measures.py
- stanscofi.validation.F1K(y_true, y_pred, k, u1)
- stanscofi.validation.Fscore(y_true, y_pred, u, beta)
- stanscofi.validation.HRk(y_true, y_pred, k, u1)
- stanscofi.validation.MAP(y_true, y_pred, u, u1)
- stanscofi.validation.MRR(y_true, y_pred, u, u1)
- stanscofi.validation.MeanRank(y_true, y_pred, k, u1)
- stanscofi.validation.NDCGk(y_true, y_pred, k, u1)
- stanscofi.validation.PrecisionK(y_true, y_pred, k, u1)
- stanscofi.validation.RP(y_true, y_pred, u, u1)
- stanscofi.validation.RecallK(y_true, y_pred, k, u1)
- stanscofi.validation.Rscore(y_true, y_pred, u, u1)
- stanscofi.validation.TAU(y_true, y_pred, u, u1)
- stanscofi.validation.compute_metrics(scores, predictions, dataset, metrics, k=1, beta=1, verbose=False)
Computes user-wise validation metrics for a given set of scores and predictions w.r.t. a dataset
…
Parameters
- scoresCOO-array of shape (n_items, n_users)
sparse matrix in COOrdinate format
- predictionsCOO-array of shape (n_items, n_users)
sparse matrix in COOrdinate format with values in {-1,1}
- datasetstanscofi.Dataset
dataset on which the metrics should be computed
- metricslst of str
list of metrics which should be computed
- kint (default: 1)
Argument of the metric to optimize. Implemented metrics are in validation.py
- betafloat (default: 1)
Argument of the metric to optimize. Implemented metrics are in validation.py
- verbosebool
prints out information about ignored users for the computation of validation metrics, that is, users which pairs are only associated to a single class (i.e., all pairs with this users are either assigned 0, -1 or 1)
Returns
- metricspandas.DataFrame of shape (len(metrics), 2)
table of metrics: metrics in rows, average and standard deviation across users in columns
- plots_argsdict
dictionary of arguments to feed to the plot_metrics function to plot the Precision-Recall and the Receiver Operating Chracteristic (ROC) curves
- stanscofi.validation.plot_metrics(y_true=None, y_pred=None, scores=None, ground_truth=None, predictions=None, aucs=None, fscores=None, tprs=None, recs=None, figsize=(16, 5), model_name='Model')
Plots the ROC curve, the Precision-Recall curve, the boxplot of predicted scores and the piechart of classes associated to the predictions y_pred in input w.r.t. ground truth y_true
…
Parameters
- y_truearray-like of shape (n_ratings,)
an array which contains the binary ground truth labels in {0,1}
- y_predarray-like of shape (n_ratings,)
an array which contains the binary predicted labels in {0,1}
- scoresarray-like of shape (n_ratings,)
an array which contains the predicted scores
- ground_trutharray-like of shape (n_ratings,)
an array which contains the ground truth labels in {-1,0,1}
- predictionsarray-like of shape (n_ratings,)
an array which contains the predicted labels in {-1,0,1}
- aucslist
list of AUCs per user
- fscoreslist
list of F-scores per user
- tprsarray-like of shape (n_thresholds,)
Increasing true positive rates such that element i is the true positive rate of predictions with score >= thresholds[i], where thresholds was given as input to sklearn.metrics.roc_curve
- recsarray-like of shape (n_thresholds,)
Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i] and the last element is 0, where thresholds was given as input to sklearn.metrics.precision_recall_curve
- figsizetuple of size 2
width and height of the figure
- model_namestr
model which predicted the ratings
Returns
- metricspandas.DataFrame of shape (2, 2)
table of metrics: AUC, F_beta score in rows, average and standard deviation across users in columns
- plots_argsdict
dictionary of arguments to feed to the plot_metrics function