stanscofi package

Submodules

stanscofi.datasets module

class stanscofi.datasets.Dataset(ratings=None, users=None, items=None, same_item_user_features=False, name='dataset')

Bases: object

A class used to encode a drug repurposing dataset (items are drugs, users are diseases)

…

Parameters

ratingsarray-like of shape (n_items, n_users): an array which contains values in {-1, 0, 1, np.nan} describing the negative, unlabelled, positive, unavailable user-item matchings
itemsarray-like of shape (n_item_features, n_items): an array which contains the item feature vectors
usersarray-like of shape (n_user_features, n_users): an array which contains the user feature vectors
same_item_user_featuresbool (default: False): whether the item and user features are the same (optional)
namestr: name of the dataset (optional)

Attributes

namestr: name of the dataset
ratingsCOO-array of shape (n_items, n_users): an array which contains values in {-1, 0, 1} describing the negative, unlabelled/unavailable, positive user-item matchings
foldsCOO-array of shape (n_items, n_users): an array which contains values in {0, 1} describing the unavailable and available user-item matchings in ratings
itemsCOO-array of shape (n_item_features, n_items): an array which contains the user feature vectors (NaN for missing features)
usersCOO-array of shape (n_user_features, n_users): an array which contains the item feature vectors (NaN for missing features)
item_listlist of str: a list of the item names in the order of row indices in ratings_mat
user_listlist of str: a list of the user names in the order of column indices in ratings_mat
item_featureslist of str: a list of the item feature names in the order of column indices in ratings_mat
user_featureslist of str: a list of the user feature names in the order of column indices in ratings_mat
same_item_user_featuresbool: whether the item and user features are the same
nusersint: number of users
nitemsint: number of items
nuser_featuresint: number of user features
nitem_featuresint: number of item features

Methods

__init__(ratings=None, users=None, items=None, same_item_user_features=False, name=”dataset”): Initialize the Dataset object and creates all attributes
summary(sep=”-”*70): Prints out the characteristics of the drug repurposing dataset
visualize(withzeros=False, X=None, y=None, figsize=(5,5), fontsize=20, dimred_args={}, predictions=None, use_ratings=False, random_state=1234, show_errors=False, verbose=False): Plots datapoints in the dataset annotated by the ground truth or predicted ratings
subset(folds, subset_name=”subset”): Creates a subset of the dataset based on the folds given as input

subset(folds, subset_name='subset')

Obtains a subset of a stanscofi.Dataset based on a set of user and item indices

…

Parameters

foldsCOO-array of shape (n_items, n_users): an array which contains values in {0, 1} describing the unavailable and available user-item matchings in ratings
subset_namestr: name of the newly created stanscofi.Dataset

Returns

subsetstanscofi.Dataset: dataset corresponding to the folds in input

summary(sep='----------------------------------------------------------------------')

Prints out a summary of the contents of a stanscofi.Dataset: the number of items, users, the number of positive, negative, unlabeled, unavailable matchings, the sparsity number, and the shape and percentage of missing values in the item and user feature matrices

…

Parameters

sepstr: separator for pretty printing

…

Returns

ndrugsint: number of drugs
ndiseasesint: number of diseases
ndrugs_knownint: number of drugs with at least one known (positive or negative) rating
ndiseases_knownint: number of diseases with at least one known (positive or negative) rating
npositiveint: number of positive ratings
nnegativeint: number of negative ratings
nunlabeled_unavailableint: number of unlabeled or unavailable ratings
nunavailableint: number of unavailable ratings
sparsityfloat: percentage of known ratings
sparsity_knownfloat: percentage of known ratings among drugs and diseases with at least one known rating
ndrug_featuresint: number of drug features
missing_drug_featuresfloat: percentage of missing drug feature values
ndisease_featuresint: number of disease features
missing_disease_featuresfloat: percentage of missing disease feature values

visualize(withzeros=False, X=None, y=None, metric='euclidean', figsize=(5, 5), fontsize=20, dimred_args={}, predictions=None, use_ratings=False, random_state=1234, show_errors=False, verbose=False)

Plots a representation of the datapoints in a stanscofi.Dataset which is annotated either by the ground truth labels or the predicted labels. The representation is the plot of the datapoints according to the first two Principal Components, or the first two dimensions in UMAP, if the feature matrices can be converted into a (n_ratings, n_features) shaped matrix where n_features>1, else it plots a heatmap with the values in the matrix for each rating pair.

In the legend, ground truth labels are denoted with brackets: e.g., [0] (unknown), [1] (positive) and [-1] (negative); predicted ratings are denoted by “pos” (positive) and “neg” (negative); correct (resp., incorrect) predictions are denoted by “correct”, resp. “error”

…

Parameters

withzerosbool: boolean to assess whether (user, item) unknown matchings should also be plotted; if withzeros=False, then only (item, user) pairs associated with known matchings will be plotted (but the unknown matching datapoints will still be used to compute the dimensionality reduction); otherwise, all pairs will be plotted
Xarray-like of shape (n_ratings, n_features) or None: (item, user) pair feature matrix
yarray-like of shape (n_ratings, ) or None: response vector for each (item, user) pair in X; necessarily X should not be None if y is not None, and vice versa; setting X and y automatically overrides the other behaviors of this function
metricstr: metric to consider to perform hierarchical clustering on the dataset. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
figsizetuple of size 2: width and height of the figure
fontsizeint: size of the legend, title and labels of the figure
dimred_argsdict: dictionary which lists the parameters to the dimensionality reduction method (either PCA, by default, or UMAP, if parameter “n_neighbors” is provided)
predictionsarray-like of shape (n_ratings, 3) or None: a matrix which contains the user indices (column 1), the item indices (column 2) and the class for the corresponding (user, item) pair (value in {-1, 0, 1} in column 3); if predictions=None, then the ground truth ratings will be used to color datapoints, otherwise, the predicted ratings will be used
use_ratingsbool: if set to True, use the ratings in the dataset as predictions (for debugging purposes)
random_stateint: random seed
show_errorsbool: boolean to assess whether to color according to the error in class prediction; if show_errors=False, then either the ground truth or the predicted class labels will be used to color the datapoints; otherwise, the points will be restricted to the set of known matchings (even if withzeros=True) and colored according to the identity between the ground truth and the predicted labels for each (user, item) pair
verbosebool: prints out information at each step

stanscofi.datasets.generate_dummy_dataset(npositive, nnegative, nfeatures, mean, std, random_state=12454)

Creates a dummy dataset where the positive and negative (item, user) pairs are arbitrarily similar.

Each of the nfeatures features for (item, user) pair feature vectors associated with positive ratings are drawn from a Gaussian distribution of mean mean and standard deviation std, whereas those for negative ratings are drawn from from a Gaussian distribution of mean -mean and standard deviation std. User and item feature matrices of shape (nfeatures//2, npositive+nnegative) are generated, which are the concatenation of npositive positive and nnegative negative pair feature vectors generated from Gaussian distributions. Thus there are npositive^2 positive ratings (each “positive” user with a “positive” item), nnegative^2 negative ratings (idem), and the remainder is unknown (that is, (npositive+nnegative)^2-npositive^2-nnegative^2 ratings).

…

Parameters

npositiveint: number of positive items/users
nnegativeint: number of negative items/users
nfeaturesint: number of item/user features
meanfloat: mean of generating Gaussian distributions
stdfloat: standard deviation of generating Gaussian distributions

Returns

ratingsarray-like of shape (n_items, n_users): a matrix which contains values in {-1, 0, 1} describing the known and unknown user-item matchings
usersarray-like of shape (n_item_features, n_items): a list of the item feature names in the order of column indices in ratings_mat
itemsarray-like of shape (n_user_features, n_users): a list of the item feature names in the order of column indices in ratings_mat

stanscofi.models module

class stanscofi.models.BasicModel(params)

Bases: object

A class used to encode a drug repurposing model

…

Parameters

paramsdict: dictionary which contains method-wise parameters

Attributes

namestr: the name of the model
modeldepends on the implemented method: may contain an instance of a class of sklearn classifiers

…: other attributes might be present depending on the type of model

Methods

__init__(params): Initializes the model with preselected parameters
fit(train_dataset, seed=1234): Preprocesses and fits the model
predict_proba(test_dataset): Outputs properly formatted predictions of the fitted model on test_dataset
predict(scores): Applies the following decision rule: if score<threshold, then return the negative label, otherwise return the positive label
recommend_k_pairs(dataset, k=1, threshold=None): Outputs the top-k (item, user) candidates (or candidates which score is higher than a threshold) in the input dataset
print_scores(scores): Prints out information about scores
print_classification(predictions): Prints out information about predicted labels
preprocessing(train_dataset) [not implemented in BasicModel]: Preprocess the input dataset into something that is an input to the self.model_fit if it exists
model_fit(train_dataset) [not implemented in BasicModel]: Fits the model on train_dataset
model_predict_proba(test_dataset) [not implemented in BasicModel]: Outputs predictions of the fitted model on test_dataset

fit(train_dataset, seed=1234)

Fitting the model on the training dataset.

Not implemented in the BasicModel class.

…

Parameters

train_datasetstanscofi.Dataset: training dataset on which the model should fit
seedint (default: 1234): random seed

model_fit()

Fitting the model on the training dataset.

…

Parameters

……: appropriate inputs to the classifier (vary across algorithms)

model_predict_proba()

Making predictions using the model on the testing dataset.

…

Parameters

……: appropriate inputs to the classifier (vary across algorithms)

…

Returns

scoresarray_like of shape (n_items, n_users): prediction values by the model

predict(scores, threshold=0.5)

Outputs class labels based on the scores, using the following formula: prediction = -1 if (score<threshold) else 1

…

Parameters

scoresCOO-array of shape (n_items, n_users): sparse matrix in COOrdinate format
thresholdfloat: the threshold of classification into the positive class

Returns

predictionsCOO-array of shape (n_items, n_users): sparse matrix in COOrdinate format with values in {-1,1}

predict_proba(test_dataset, default_zero_val=1e-31)

Outputs properly formatted scores (not necessarily in [0,1]!) from the fitted model on test_dataset. Internally calls model_predict() then reformats the scores

…

Parameters

test_datasetstanscofi.Dataset: dataset on which predictions should be made

Returns

scoresCOO-array of shape (n_items, n_users): sparse matrix in COOrdinate format, with nonzero values corresponding to predictions on available pairs in the dataset

preprocessing(dataset, is_training=True)

Preprocessing step, which converts elements of a dataset (ratings matrix, user feature matrix, item feature matrix) into appropriate inputs to the classifier (e.g., X feature matrix for each (user, item) pair, y response vector).

…

Parameters

datasetstanscofi.Dataset: dataset to convert
is_trainingbool: is the preprocessing prior to training (true) or testing (false)?

Returns

……: appropriate inputs to the classifier (vary across algorithms)

print_classification(predictions)

Prints out information about the predicted classes

…

Parameters

predictionsCOO-array: sparse matrix in COOrdinate format

print_scores(scores)

Prints out information about the scores

…

Parameters

scoresCOO-array: sparse matrix in COOrdinate format

recommend_k_pairs(dataset, k=1, threshold=None)

Outputs the top-k (item, user) candidates (or candidates which score is higher than a threshold) in the input dataset

…

Parameters

datasetstanscofi.Dataset: dataset on which predictions should be made
kint or None (default: 1): number of pair candidates to return (with ties)
thresholdfloat or None (default: 0): threshold on candidate scores. If k is not None, k best candidates are returned independently of the value of threshold

…

Parameters

candidateslist of tuples of size 3: list of (item, user, score) candidates (by name as present in the dataset)

class stanscofi.models.LogisticRegression(params)

Bases: BasicModel

Logistic Regression (calls sklearn.linear_model.LogisticRegression internally). It uses the very same parameters as sklearn.linear_model.LogisticRegression, so please refer to help(sklearn.linear_model.LogisticRegression).

…

Parameters

paramsdict: dictionary which contains sklearn.linear_model.LogisticRegression parameters, plus a key called “preprocessing” which determines which preprocessing function (in stanscofi.preprocessing) should be applied to data, plus a key called “subset” which gives the maximum number of features to consider in the model (those features will be the Top-subset in terms of variance across samples)

Attributes

Same as BasicModel class

Methods

Same as BasicModel class preprocessing(train_dataset)

Preprocesses the input dataset into something that is an input to fit

model_fit(train_dataset): Preprocesses and fits the model
model_predict_proba(test_dataset): Outputs predictions of the fitted model on test_dataset

model_fit(X, y)

Fitting the Logistic Regression model on the training dataset.

…

Parameters

Xarray-like of shape (n_ratings, n_pair_features): (user, item) feature matrix (the actual contents of the matrix depends on parameters “preprocessing” and “subset” given as input
yarray-like of shape (n_ratings, ): response vector for each (user, item) pair

model_predict_proba(X)

Making predictions using the Logistic Regression model on the testing dataset.

…

Parameters

Xarray-like of shape (n_ratings, n_pair_features): (user, item) feature matrix (the actual contents of the matrix depends on parameters “preprocessing” and “subset” given as input

preprocessing(dataset, is_training=True)

Preprocessing step, which converts elements of a dataset (ratings matrix, user feature matrix, item feature matrix) into appropriate inputs to the Logistic Regression classifier.

…

Parameters

datasetstanscofi.Dataset: dataset to convert
is_trainingbool: is the preprocessing prior to training (true) or testing (false)?

Returns

args : contains X : array-like of shape (n_ratings, n_pair_features)

(user, item) feature matrix (the actual contents of the matrix depends on parameters “preprocessing” and “subset” given as input

yarray-like of shape (n_ratings, ): response vector for each (user, item) pair

class stanscofi.models.NMF(params)

Bases: BasicModel

Non-negative Matrix Factorization (calls sklearn.decomposition.NMF internally). It uses the very same parameters as sklearn.decomposition.NMF, so please refer to help(sklearn.decomposition.NMF).

…

Parameters

paramsdict: dictionary which contains sklearn.decomposition.NMF parameters

Attributes

Same as BasicModel class

Methods

Same as BasicModel class preprocessing(train_dataset)

Preprocesses the input dataset into something that is an input to fit

model_fit(train_dataset): Preprocesses and fits the model
model_predict_proba(test_dataset): Outputs predictions of the fitted model on test_dataset

model_fit(input)

Fitting the NMF model on the preprocessed training dataset.

…

Parameters

inputarray-like of shape (n_samples,n_features): training data

model_predict_proba(input)

Making predictions using the NMF model on the testing dataset.

…

Parameters

inputarray-like of shape (n_samples,n_features): testing data

…

Returns

result : array-like of shape (n_samples,n_features)

preprocessing(dataset, is_training=True)

Preprocessing step, which converts elements of a dataset (ratings matrix, user feature matrix, item feature matrix) into appropriate inputs to the NMF classifier.

…

Parameters

datasetstanscofi.Dataset: dataset to convert
is_trainingbool: is the preprocessing prior to training (true) or testing (false)?

Returns

args : contains A : array-like of shape (n_users, n_items)

contains the transposed translated association matrix so that all its values are non-negative

stanscofi.preprocessing module

class stanscofi.preprocessing.CustomScaler(posinf, neginf)

Bases: object

A class used to encode a simple preprocessing pipeline for feature matrices. Does mean imputation for features, feature filtering, correction of infinity errors and standardization

…

Parameters

posinfint: Value to replace infinity (positive) values
neginfint: Value to replace infinity (negative) values

Attributes

imputerNone or sklearn.impute.SimpleImputer instance: Class for imputation of values
scalerNone or sklearn.preprocessing.StandardScaler: Class for standardization of values
filterNone or list: List of selected features (Top-N in terms of variance)

Methods

__init__(params): Initialize the scaler (with unfitted attributes)
fit_transform(mat, subset=None, verbose=False): Fits classes and transforms a matrix

fit_transform(mat, subset=None, verbose=False)

Fits each attribute of the scaler and transform a feature matrix. Does mean imputation for features, feature filtering, correction of infinity errors and standardization

…

Parameters

matarray-like of shape (n_samples, n_features): matrix which should be preprocessed
subsetNone or int: number of features to keep in feature matrix (Top-N in variance); if it is None, attribute filter is either initialized (if it is equal to None) or used to filter features
verbosebool: prints out information

Returns

mat_nanarray-like of shape (n_samples, n_features): Preprocessed matrix

stanscofi.preprocessing.Perlman_procedure(dataset, njobs=1, sep_feature=None, missing=-666, inf=2, verbose=False)

Method for combining (several) item and user similarity matrices (reference DOI: 10.1089/cmb.2010.0213). Instead of concatenating item and user features for a given pair, resulting in a vector of size (n_items x n_item_matrices)+(n_users x n_user_matrices), compute a single score per pair of (item_matrix, user_matrix) for each (item, user) pair, resulting in a vector of size (n_item_matrices) x (n_user_matrices).

The score for any item i, user u, item-item similarity fi and user-user similarity fu is score_{fi,fu}(i,u) = max { sqrt(fi(dr, dr’) x fu(di’, di)) | (i’,u’)!=(i,u), fi(dr, dr’)!=NaN, fu(di’, di)!=NaN, rating(i’,u’)!=0 }

Then the final feature matrix is X = (X_{i,j})_{i,j} for i a (item, user) pair and j a (item similarity, user similarity) pair

…

Parameters

datasetstanscofi.Dataset

dataset which should be transformed, with n_items items (with n_item_features features) and n_users users (with n_user_features features) with the following attributes

ratingsCOO-array of shape (n_items, n_users): an array which contains values in {-1, 0, 1} describing the negative, unlabelled/unavailable, positive user-item matchings
itemsCOO-array of shape (n_item_features, n_items): concatenation of n_drug_features drug similarity matrices of shape (n_drugs, n_drugs), where values in item_features are denoted by “<feature><sep_feature><drug>” and missing values are denoted by numpy.nan; if the prefix in “<feature><sep_feature>” is missing, it is assumed that items is a single similarity matrix (n_item_matrices=1)
usersCOO-array of shape (n_user_features, n_users): concatenation of n_disease_features drug similarity matrices of shape (n_diseases, n_diseases), where values in user_features are denoted by “<feature><sep_feature><disease>” and missing values are denoted by numpy.nan; if the prefix in “<feature><sep_feature>” is missing, it is assumed that users is a single similarity matrix (n_user_matrices=1)

NaN values are replaced by 0, whereas infinite values are replaced by inf (parameter below).

njobsint

number of jobs to run in parallel

sep_featurestr or None

separator between feature type and element in the feature matrices in dataset. None if there is one single feature type expected

missingint

placeholder value that should be different from any feature name

infint

Value that replaces infinite values in the dataset (inf for +infinity, -inf for -infinity)

verbosebool

prints out information

Returns

Xarray-like of shape (n_items x n_users, n_item_features x n_user_features): the feature matrix
yarray-like of shape (n_items x n_users, ): the response/outcome vector

stanscofi.preprocessing.cartesian_product_transpose(*arrays)

stanscofi.preprocessing.meanimputation_standardize(dataset, subset=None, scalerS=None, scalerP=None, inf=10, verbose=False)

Computes a single feature matrix and response vector from a drug repurposing dataset, by imputation by the average value of a feature for missing values and by centering and standardizing user and item feature matrices and concatenating them

…

Parameters

datasetstanscofi.Dataset: dataset which should be transformed, with n_items items (with n_item_features features) and n_users users (with n_user_features features) where missing values are denoted by numpy.nan
subsetNone or int: number of features to keep in item feature matrix, and in user feature matrix (selecting the ones with highest variance)
scalerSNone or sklearn.preprocessing.StandardScaler instance: scaler for items
scalerPNone or sklearn.preprocessing.StandardScaler instance: scaler for users
verbosebool: prints out information

Returns

Xarray-like of shape (n_folds, n_item_features+n_user_features): the feature matrix
yarray-like of shape (n_folds, ): the response/outcome vector
scalerSNone or stanscofi.models.CustomScaler instance: scaler for items; if the input value was None, returns the scaler fitted on item feature vectors
scalerPNone or stanscofi.models.CustomScaler instance: scaler for users; if the input value was None, returns the scaler fitted on user feature vectors

stanscofi.preprocessing.preprocessing_XY(dataset, preprocessing_str, operator='*', sep_feature='-', subset_=None, filter_=None, scalerS=None, scalerP=None, inf=2, njobs=1)

Converts a score vector or a score value into a list of scores

…

Parameters

datasetstanscofi.datasets.Dataset: dataset to preprocess
preprocessing_strstr: type of preprocessing: in [“Perlman_procedure”,”meanimputation_standardize”,”same_feature_preprocessing”].
subset_None or int: Number of features to restrict the dataset to (Top-subset_ features in terms of cross-sample variance) /!across user and item features if preprocessing_str!=”meanimputation_standardize” otherwise 2*subset_ features are preserved (subset_ for item features, subset_ for user features)
operatorNone or str: arithmetric operation to apply, ex. “+”, “*”
sep_featurestr: separator between feature type and element in the feature matrices in dataset
filter_None or list: list of feature indices to keep (of length subset_) (overrides the subset_ parameter if both are fed)
scalerSNone or stanscofi.models.CustomScaler instance: scaler for items; the scaler fitted on item feature vectors
scalerPNone or stanscofi.models.CustomScaler instance: scaler for users; the scaler fitted on user feature vectors
inffloat or int: placeholder value for infinity values (positive : +inf, negative : -inf)
njobsint: number of jobs to run in parallel (njobs > 0) for the Perlman procedure

Returns

Xarray-like of shape (n_folds, n_features): the feature matrix
yarray-like of shape (n_folds, ): the response/outcome vector
scalerSNone or stanscofi.models.CustomScaler instance: scaler for items; if the input value was None, returns the scaler fitted on item feature vectors
scalerPNone or stanscofi.models.CustomScaler instance: scaler for users; if the input value was None, returns the scaler fitted on user feature vectors
filter_None or list: list of feature indices to keep (of length subset_)

stanscofi.preprocessing.same_feature_preprocessing(dataset, operator)

If the users and items have the same features in the dataset, then a simple way to combine the user and item feature matrices is to apply an element-wise arithmetic operator (*, +, etc.) to the feature vectors coefficient per coefficient.

…

Parameters

datasetstanscofi.Dataset: dataset which should be transformed, where n_item_features==n_user_features and dataset.same_item_user_features==True
operatorstr: arithmetric operation to apply, ex. “+”, “*”

Returns

Xarray-like of shape (n_folds, n_features): the feature matrix
yarray-like of shape (n_folds, ): the response/outcome vector

stanscofi.training_testing module

stanscofi.training_testing.cv_training(template, params, train_dataset, nsplits, metric, k=1, beta=1, threshold=0, test_size=0.2, dist_type='cosine', cv_type='random', early_stop=2, njobs=1, random_state=1234, show_plots=False, verbose=False)

Trains a model on a dataset using cross-validation and custom metrics using sklearn.model_selection.StratifiedKFold

…

Parameters

templatestanscofi.BasicModel or subclass: type of model to train
paramsdict: dictionary of parameters to initialize the model
train_datasetstanscofi.Dataset: dataset to train upon
nsplitsint: number of cross-validation steps
metricstr: metric to optimize the model upon. Implemented metrics are in validation.py
kint (default: 1): Argument of the metric to optimize. Implemented metrics are in validation.py
betafloat (default: 1): Argument of the metric to optimize. Implemented metrics are in validation.py
thresholdfloat (default: 0): decision threshold
test_sizefloat (default: 0.2): percentage of testing set (if cv_type=”weakly_correlated”)
dist_typestr (default: “cosine”): type of metric for splitting (if cv_type=”weakly_correlated”)
cv_typestr (default: “random”): type of split to apply to the dataset. Can either be “random” or “weakly_correlated”
early_stopint or None: positive integer, which stops the cluster number search after 3 tries yielding the same number; note that if early_stop is not None, then the property on test_size will not necessarily hold anymore
njobsint (default: 1): number of jobs to run in parallel. Should be lower than nsplits-1
random_stateint (default: 1234): random seed
show_plotsbool (default: False): shows the validation plots at each cross-validation step
verbosebool (default: False): prints out information

Returns

resultsdict

a dictionary which contains

“models”list of subinstances of stanscofi.models.BasicModel of length nsplits: all trained models
“train_metric”list of floats of length nsplits: all metrics on training sets
“test_metric”list of floats of length nsplits: all metrics on testing sets
“cv_folds”list of COO-array of shape (n_items, n_users) of length nsplits: the training and testing folds for each split

stanscofi.training_testing.grid_search(search_params, template, params, train_dataset, nsplits, metric, k=1, beta=1, threshold=0, test_size=0.2, dist_type='cosine', cv_type='random', early_stop=2, njobs=1, random_state=1234, show_plots=False, verbose=False)

Grid-search over hyperparameters, iteratively optimizing over one parameter at a time, and internally calling cv_training.

…

Parameters

search_paramsdict: a dictionary which contains as keys the hyperparameter names and as values the corresponding intervals to explore during the grid-search
templatestanscofi.BasicModel or subclass: type of model to train
paramsdict: dictionary of parameters to initialize the model
train_datasetstanscofi.Dataset: dataset to train upon
metricstr: metric to optimize the model upon. Implemented metrics are in validation.py
kint (default: 1): Argument of the metric to optimize. Implemented metrics are in validation.py
betafloat (default: 1): Argument of the metric to optimize. Implemented metrics are in validation.py
thresholdfloat (default: 0): decision threshold
test_sizefloat (default: 0.2): percentage of testing set (if cv_type=”weakly_correlated”)
dist_typestr (default: “cosine”): type of metric for splitting (if cv_type=”weakly_correlated”)
cv_typestr (default: “random”): type of split to apply to the dataset. Can either be “random” or “weakly_correlated”
njobsint (default: 1): number of jobs to run in parallel. Should be lower than nsplits-1
random_stateint (default: 1234): random seed
show_plotsbool (default: False): shows the validation plots at each cross-validation step
verbosebool (default: False): prints out information

Returns

best_paramsdict

a dictionary which contains as keys the hyperparameter names and as values the best values obtained across all grid-search steps

best_modelsubinstance of stanscofi.models.BasicModel

the best trained model associated with the best parameters

metricsdict

a dictionary which contains

“train_metric”float: the metric on the training set on the best crossvalidation split for the best set of parameters
“test_metric”float: the metric on the testing set on the best crossvalidation split for the best set of parameters

stanscofi.training_testing.indices_to_folds(indices, indices_array, shape)

Converts indices of datapoints into folds as defined in stanscofi

…

Parameters

indicesarray-like of size (n_selected_ratings, ): flat indices of selected datapoints
indices_arrayarray-like of size (n_total_ratings, 2): corresponding row and column indices of datapoints
shapetuple of integers of size 2: total numbers of rows and columns

Returns

foldsCOO-array of shape shape: folds which can be fed to other functions in stanscofi, e.g., dataset.subset(folds)

stanscofi.training_testing.random_cv_split(dataset, cv_generator, metric='cosine')

Splits the data into training and testing datasets randomly for cross-validation.

…

Parameters

datasetstanscofi.Dataset: dataset to split
cv_generatorscikit-learn cross-validation index generator: e.g. StratifiedKFold, KFold
metricstr: metric to consider to assess distance between training and testing sets. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

Returns

cv_foldslist of size nsplits of COO-array of shape (n_items, n_users): list of arrays which contain values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set
dist_lstlist of size nsplits of tuples of float of size 3: for each fold, minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set

stanscofi.training_testing.random_simple_split(dataset, test_size, metric='cosine', random_state=1234)

Splits the data into training and testing datasets randomly.

…

Parameters

datasetstanscofi.Dataset: dataset to split
test_sizefloat: value between 0 and 1 (strictly) which indicates the maximum percentage of initial data (positive and negative ratings) being assigned to the test dataset
metricstr: metric to consider to assess distance between training and testing sets. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
random_stateint: random seed

Returns

cv_foldslist of COO-array of shape (n_items, n_users): list of arrays which contain values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set
dist_train_test, dist_train, dist_testfloat: minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set

stanscofi.training_testing.weakly_correlated_split(dataset, test_size, early_stop=None, metric='cosine', random_state=1234, niter=100, verbose=False)

Splits the data into training and testing datasets with a low correlation among items, by applying a hierarchical clustering on the item feature matrix. NaNs in the item feature matrix are converted to 0.

…

Parameters

datasetstanscofi.Dataset: dataset to split
test_sizefloat: value between 0 and 1 (strictly) which indicates the maximum percentage of initial data (positive and negative ratings) being assigned to the test dataset
early_stopint or None: positive integer, which stops the cluster number search after 3 tries yielding the same number; note that if early_stop is not None, then the property on test_size will not necessarily hold anymore
metricstr: metric to consider to perform hierarchical clustering on the dataset. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
random_stateint: random seed
niterint: maximum number of iterations of the clustering loop
verbosebool: prints out information

Returns

train_folds, test_foldsCOO-array of shape (n_items, n_users): an array which contains values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set
dist_train_test, dist_train, dist_testfloat: minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set

stanscofi.utils module

stanscofi.utils.compute_sparsity(df)

Computes the sparsity number of a collaborative filtering dataset

…

Parameters

dfpandas.DataFrame of shape (n_items, n_users): the matrix of ratings where unknown matchings are denoted with 0

Returns

sparsityfloat: the percentage of non missing values in the matrix of ratings

stanscofi.utils.load_dataset(model_name, save_folder='./', sep_feature='-')

Loads a drug repurposing dataset

…

Parameters

model_namestr: the name of the dataset to load. Should belong to the following list: [“Gottlieb”, “DNdataset”, “Cdataset”, “LRSSL”, “PREDICT_Gottlieb”, “TRANSCRIPT”, “PREDICT”]
save_folderstr: the path to the folder where dataset-related files are or will be stored

Returns

dataset_didictionary: a dictionary where key “ratings” contains the drug-disease matching pandas.DataFrame of shape (n_drugs, n_diseases) (where missing values are denoted by 0), key “users” correspond to the disease pandas.DataFrame of shape (n_disease_features, n_diseases), and “items” correspond to the drug feature pandas.DataFrame of shape (n_drug_features, n_drugs)

stanscofi.utils.matrix2ratings(df, user_col='user', item_col='item', rating_col='rating')

Converts a matrix into a list of ratings

…

Parameters

dfpandas.DataFrame of shape (n_items, n_users): the matrix of ratings in {-1, 1, 0} where unknown matchings are denoted with 0
user_colstr: column denoting users
item_colstr: column denoting items
rating_colstr: column denoting ratings in {-1, 0, 1}

Returns

ratingspandas.DataFrame of shape (n_ratings, 3): the list of known ratings where the first column correspond to users, second to items, third to ratings

stanscofi.utils.merge_ratings(rating_dfs, user_col, item_col, rating_col)

Merges rating lists from several sources by solving conflicts. Conflicting ratings are resolved as follows: if there is at least one negative rating (-1) reported for a (drug, disease) pair, then the final rating is negative (-1); if there is at least one positive rating (1) and no negative rating (-1) reported, then the final rating is positive (1)

…

Parameters

rating_dfslist of pandas.DataFrame of shape (n_ratings, 3): the list of rating lists where one column (of name user_col) is associated with users, one column (of name item_col) is associated with items, and one column (of name rating_col) is associated with ratings in {-1, 0, 1}
user_colstr: column denoting users
item_colstr: column denoting items
rating_colstr: column denoting ratings in {-1, 0, 1}

verbose : bool

Returns

rating_dfpandas.DataFrame of shape (n_ratings, 3): the list of rating lists where one column (of name user_col) is associated with users, one column (of name item_col) is associated with items, and one column (of name rating_col) is associated with ratings in {-1, 0, 1}

stanscofi.utils.print_dataset(ratings, user_col, item_col, rating_col)

Prints values of a drug repurposing dataset

…

Parameters

ratingspandas.DataFrame of shape (n_ratings, 3): the list of ratings with columns user_col, item_col, rating_col
user_colstr: column denoting users
item_colstr: column denoting items
rating_colstr: column denoting ratings in {-1, 0, 1}

Returns

None

Prints

The number of items/drugs, users/diseases, and the number of positive (1), negative (-1) and unknown (0) matchings.

stanscofi.utils.ratings2matrix(ratings, user_col, item_col, rating_col)

Converts a list of ratings into a matrix

…

Parameters

ratingspandas.DataFrame of shape (n_ratings, 3): the list of known ratings where the first column (user_col) correspond to users, second (item_col) to items, third (rating_col) to ratings in {-1,0,1}
user_colstr: column denoting users
item_colstr: column denoting items
rating_colstr: column denoting ratings in {-1, 0, 1}

Returns

dfpandas.DataFrame of shape (n_items, n_users): the matrix of ratings in {-1, 1, 0} where unknown matchings are denoted with 0

stanscofi.validation module

stanscofi.validation.AP(y_true, y_pred, u, u1)

stanscofi.validation.AUC(y_true, y_pred, k, u1)

stanscofi.validation.DCGk(y_true, y_pred, k, u1)

stanscofi.validation.ERR(y_true, y_pred, max=10, max_grade=2): source: https://raw.githubusercontent.com/skondo/evaluation_measures/master/evaluations_measures.py

stanscofi.validation.F1K(y_true, y_pred, k, u1)

stanscofi.validation.Fscore(y_true, y_pred, u, beta)

stanscofi.validation.HRk(y_true, y_pred, k, u1)

stanscofi.validation.MAP(y_true, y_pred, u, u1)

stanscofi.validation.MRR(y_true, y_pred, u, u1)

stanscofi.validation.MeanRank(y_true, y_pred, k, u1)

stanscofi.validation.NDCGk(y_true, y_pred, k, u1)

stanscofi.validation.PrecisionK(y_true, y_pred, k, u1)

stanscofi.validation.RP(y_true, y_pred, u, u1)

stanscofi.validation.RecallK(y_true, y_pred, k, u1)

stanscofi.validation.Rscore(y_true, y_pred, u, u1)

stanscofi.validation.TAU(y_true, y_pred, u, u1)

stanscofi.validation.compute_metrics(scores, predictions, dataset, metrics, k=1, beta=1, verbose=False)

Computes user-wise validation metrics for a given set of scores and predictions w.r.t. a dataset

…

Parameters

scoresCOO-array of shape (n_items, n_users): sparse matrix in COOrdinate format
predictionsCOO-array of shape (n_items, n_users): sparse matrix in COOrdinate format with values in {-1,1}
datasetstanscofi.Dataset: dataset on which the metrics should be computed
metricslst of str: list of metrics which should be computed
kint (default: 1): Argument of the metric to optimize. Implemented metrics are in validation.py
betafloat (default: 1): Argument of the metric to optimize. Implemented metrics are in validation.py
verbosebool: prints out information about ignored users for the computation of validation metrics, that is, users which pairs are only associated to a single class (i.e., all pairs with this users are either assigned 0, -1 or 1)

Returns

metricspandas.DataFrame of shape (len(metrics), 2): table of metrics: metrics in rows, average and standard deviation across users in columns
plots_argsdict: dictionary of arguments to feed to the plot_metrics function to plot the Precision-Recall and the Receiver Operating Chracteristic (ROC) curves

stanscofi.validation.plot_metrics(y_true=None, y_pred=None, scores=None, ground_truth=None, predictions=None, aucs=None, fscores=None, tprs=None, recs=None, figsize=(16, 5), model_name='Model')

Plots the ROC curve, the Precision-Recall curve, the boxplot of predicted scores and the piechart of classes associated to the predictions y_pred in input w.r.t. ground truth y_true

…

Parameters

y_truearray-like of shape (n_ratings,): an array which contains the binary ground truth labels in {0,1}
y_predarray-like of shape (n_ratings,): an array which contains the binary predicted labels in {0,1}
scoresarray-like of shape (n_ratings,): an array which contains the predicted scores
ground_trutharray-like of shape (n_ratings,): an array which contains the ground truth labels in {-1,0,1}
predictionsarray-like of shape (n_ratings,): an array which contains the predicted labels in {-1,0,1}
aucslist: list of AUCs per user
fscoreslist: list of F-scores per user
tprsarray-like of shape (n_thresholds,): Increasing true positive rates such that element i is the true positive rate of predictions with score >= thresholds[i], where thresholds was given as input to sklearn.metrics.roc_curve
recsarray-like of shape (n_thresholds,): Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i] and the last element is 0, where thresholds was given as input to sklearn.metrics.precision_recall_curve
figsizetuple of size 2: width and height of the figure
model_namestr: model which predicted the ratings

Returns

metricspandas.DataFrame of shape (2, 2): table of metrics: AUC, F_beta score in rows, average and standard deviation across users in columns
plots_argsdict: dictionary of arguments to feed to the plot_metrics function