stanscofi package

Submodules

stanscofi.datasets module

class stanscofi.datasets.Dataset(ratings=None, users=None, items=None, same_item_user_features=False, name='dataset')

Bases: object

A class used to encode a drug repurposing dataset (items are drugs, users are diseases)

Parameters

ratingsarray-like of shape (n_items, n_users)

an array which contains values in {-1, 0, 1, np.nan} describing the negative, unlabelled, positive, unavailable user-item matchings

itemsarray-like of shape (n_item_features, n_items)

an array which contains the item feature vectors

usersarray-like of shape (n_user_features, n_users)

an array which contains the user feature vectors

same_item_user_featuresbool (default: False)

whether the item and user features are the same (optional)

namestr

name of the dataset (optional)

Attributes

namestr

name of the dataset

ratingsCOO-array of shape (n_items, n_users)

an array which contains values in {-1, 0, 1} describing the negative, unlabelled/unavailable, positive user-item matchings

foldsCOO-array of shape (n_items, n_users)

an array which contains values in {0, 1} describing the unavailable and available user-item matchings in ratings

itemsCOO-array of shape (n_item_features, n_items)

an array which contains the user feature vectors (NaN for missing features)

usersCOO-array of shape (n_user_features, n_users)

an array which contains the item feature vectors (NaN for missing features)

item_listlist of str

a list of the item names in the order of row indices in ratings_mat

user_listlist of str

a list of the user names in the order of column indices in ratings_mat

item_featureslist of str

a list of the item feature names in the order of column indices in ratings_mat

user_featureslist of str

a list of the user feature names in the order of column indices in ratings_mat

same_item_user_featuresbool

whether the item and user features are the same

nusersint

number of users

nitemsint

number of items

nuser_featuresint

number of user features

nitem_featuresint

number of item features

Methods

__init__(ratings=None, users=None, items=None, same_item_user_features=False, name=”dataset”)

Initialize the Dataset object and creates all attributes

summary(sep=”-”*70)

Prints out the characteristics of the drug repurposing dataset

visualize(withzeros=False, X=None, y=None, figsize=(5,5), fontsize=20, dimred_args={}, predictions=None, use_ratings=False, random_state=1234, show_errors=False, verbose=False)

Plots datapoints in the dataset annotated by the ground truth or predicted ratings

subset(folds, subset_name=”subset”)

Creates a subset of the dataset based on the folds given as input

subset(folds, subset_name='subset')

Obtains a subset of a stanscofi.Dataset based on a set of user and item indices

Parameters

foldsCOO-array of shape (n_items, n_users)

an array which contains values in {0, 1} describing the unavailable and available user-item matchings in ratings

subset_namestr

name of the newly created stanscofi.Dataset

Returns

subsetstanscofi.Dataset

dataset corresponding to the folds in input

summary(sep='----------------------------------------------------------------------')

Prints out a summary of the contents of a stanscofi.Dataset: the number of items, users, the number of positive, negative, unlabeled, unavailable matchings, the sparsity number, and the shape and percentage of missing values in the item and user feature matrices

Parameters

sepstr

separator for pretty printing

Returns

ndrugsint

number of drugs

ndiseasesint

number of diseases

ndrugs_knownint

number of drugs with at least one known (positive or negative) rating

ndiseases_knownint

number of diseases with at least one known (positive or negative) rating

npositiveint

number of positive ratings

nnegativeint

number of negative ratings

nunlabeled_unavailableint

number of unlabeled or unavailable ratings

nunavailableint

number of unavailable ratings

sparsityfloat

percentage of known ratings

sparsity_knownfloat

percentage of known ratings among drugs and diseases with at least one known rating

ndrug_featuresint

number of drug features

missing_drug_featuresfloat

percentage of missing drug feature values

ndisease_featuresint

number of disease features

missing_disease_featuresfloat

percentage of missing disease feature values

visualize(withzeros=False, X=None, y=None, metric='euclidean', figsize=(5, 5), fontsize=20, dimred_args={}, predictions=None, use_ratings=False, random_state=1234, show_errors=False, verbose=False)

Plots a representation of the datapoints in a stanscofi.Dataset which is annotated either by the ground truth labels or the predicted labels. The representation is the plot of the datapoints according to the first two Principal Components, or the first two dimensions in UMAP, if the feature matrices can be converted into a (n_ratings, n_features) shaped matrix where n_features>1, else it plots a heatmap with the values in the matrix for each rating pair.

In the legend, ground truth labels are denoted with brackets: e.g., [0] (unknown), [1] (positive) and [-1] (negative); predicted ratings are denoted by “pos” (positive) and “neg” (negative); correct (resp., incorrect) predictions are denoted by “correct”, resp. “error”

Parameters

withzerosbool

boolean to assess whether (user, item) unknown matchings should also be plotted; if withzeros=False, then only (item, user) pairs associated with known matchings will be plotted (but the unknown matching datapoints will still be used to compute the dimensionality reduction); otherwise, all pairs will be plotted

Xarray-like of shape (n_ratings, n_features) or None

(item, user) pair feature matrix

yarray-like of shape (n_ratings, ) or None

response vector for each (item, user) pair in X; necessarily X should not be None if y is not None, and vice versa; setting X and y automatically overrides the other behaviors of this function

metricstr

metric to consider to perform hierarchical clustering on the dataset. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

figsizetuple of size 2

width and height of the figure

fontsizeint

size of the legend, title and labels of the figure

dimred_argsdict

dictionary which lists the parameters to the dimensionality reduction method (either PCA, by default, or UMAP, if parameter “n_neighbors” is provided)

predictionsarray-like of shape (n_ratings, 3) or None

a matrix which contains the user indices (column 1), the item indices (column 2) and the class for the corresponding (user, item) pair (value in {-1, 0, 1} in column 3); if predictions=None, then the ground truth ratings will be used to color datapoints, otherwise, the predicted ratings will be used

use_ratingsbool

if set to True, use the ratings in the dataset as predictions (for debugging purposes)

random_stateint

random seed

show_errorsbool

boolean to assess whether to color according to the error in class prediction; if show_errors=False, then either the ground truth or the predicted class labels will be used to color the datapoints; otherwise, the points will be restricted to the set of known matchings (even if withzeros=True) and colored according to the identity between the ground truth and the predicted labels for each (user, item) pair

verbosebool

prints out information at each step

stanscofi.datasets.generate_dummy_dataset(npositive, nnegative, nfeatures, mean, std, random_state=12454)

Creates a dummy dataset where the positive and negative (item, user) pairs are arbitrarily similar.

Each of the nfeatures features for (item, user) pair feature vectors associated with positive ratings are drawn from a Gaussian distribution of mean mean and standard deviation std, whereas those for negative ratings are drawn from from a Gaussian distribution of mean -mean and standard deviation std. User and item feature matrices of shape (nfeatures//2, npositive+nnegative) are generated, which are the concatenation of npositive positive and nnegative negative pair feature vectors generated from Gaussian distributions. Thus there are npositive^2 positive ratings (each “positive” user with a “positive” item), nnegative^2 negative ratings (idem), and the remainder is unknown (that is, (npositive+nnegative)^2-npositive^2-nnegative^2 ratings).

Parameters

npositiveint

number of positive items/users

nnegativeint

number of negative items/users

nfeaturesint

number of item/user features

meanfloat

mean of generating Gaussian distributions

stdfloat

standard deviation of generating Gaussian distributions

Returns

ratingsarray-like of shape (n_items, n_users)

a matrix which contains values in {-1, 0, 1} describing the known and unknown user-item matchings

usersarray-like of shape (n_item_features, n_items)

a list of the item feature names in the order of column indices in ratings_mat

itemsarray-like of shape (n_user_features, n_users)

a list of the item feature names in the order of column indices in ratings_mat

stanscofi.models module

class stanscofi.models.BasicModel(params)

Bases: object

A class used to encode a drug repurposing model

Parameters

paramsdict

dictionary which contains method-wise parameters

Attributes

namestr

the name of the model

modeldepends on the implemented method

may contain an instance of a class of sklearn classifiers

other attributes might be present depending on the type of model

Methods

__init__(params)

Initializes the model with preselected parameters

fit(train_dataset, seed=1234)

Preprocesses and fits the model

predict_proba(test_dataset)

Outputs properly formatted predictions of the fitted model on test_dataset

predict(scores)

Applies the following decision rule: if score<threshold, then return the negative label, otherwise return the positive label

recommend_k_pairs(dataset, k=1, threshold=None)

Outputs the top-k (item, user) candidates (or candidates which score is higher than a threshold) in the input dataset

print_scores(scores)

Prints out information about scores

print_classification(predictions)

Prints out information about predicted labels

preprocessing(train_dataset) [not implemented in BasicModel]

Preprocess the input dataset into something that is an input to the self.model_fit if it exists

model_fit(train_dataset) [not implemented in BasicModel]

Fits the model on train_dataset

model_predict_proba(test_dataset) [not implemented in BasicModel]

Outputs predictions of the fitted model on test_dataset

fit(train_dataset, seed=1234)

Fitting the model on the training dataset.

Not implemented in the BasicModel class.

Parameters

train_datasetstanscofi.Dataset

training dataset on which the model should fit

seedint (default: 1234)

random seed

model_fit()

Fitting the model on the training dataset.

<Not implemented in the BasicModel class.>

Parameters

appropriate inputs to the classifier (vary across algorithms)

model_predict_proba()

Making predictions using the model on the testing dataset.

<Not implemented in the BasicModel class.>

Parameters

appropriate inputs to the classifier (vary across algorithms)

Returns

scoresarray_like of shape (n_items, n_users)

prediction values by the model

predict(scores, threshold=0.5)
Outputs class labels based on the scores, using the following formula

prediction = -1 if (score<threshold) else 1

Parameters

scoresCOO-array of shape (n_items, n_users)

sparse matrix in COOrdinate format

thresholdfloat

the threshold of classification into the positive class

Returns

predictionsCOO-array of shape (n_items, n_users)

sparse matrix in COOrdinate format with values in {-1,1}

predict_proba(test_dataset, default_zero_val=1e-31)

Outputs properly formatted scores (not necessarily in [0,1]!) from the fitted model on test_dataset. Internally calls model_predict() then reformats the scores

Parameters

test_datasetstanscofi.Dataset

dataset on which predictions should be made

Returns

scoresCOO-array of shape (n_items, n_users)

sparse matrix in COOrdinate format, with nonzero values corresponding to predictions on available pairs in the dataset

preprocessing(dataset, is_training=True)

Preprocessing step, which converts elements of a dataset (ratings matrix, user feature matrix, item feature matrix) into appropriate inputs to the classifier (e.g., X feature matrix for each (user, item) pair, y response vector).

<Not implemented in the BasicModel class.>

Parameters

datasetstanscofi.Dataset

dataset to convert

is_trainingbool

is the preprocessing prior to training (true) or testing (false)?

Returns

appropriate inputs to the classifier (vary across algorithms)

print_classification(predictions)

Prints out information about the predicted classes

Parameters

predictionsCOO-array

sparse matrix in COOrdinate format

print_scores(scores)

Prints out information about the scores

Parameters

scoresCOO-array

sparse matrix in COOrdinate format

recommend_k_pairs(dataset, k=1, threshold=None)

Outputs the top-k (item, user) candidates (or candidates which score is higher than a threshold) in the input dataset

Parameters

datasetstanscofi.Dataset

dataset on which predictions should be made

kint or None (default: 1)

number of pair candidates to return (with ties)

thresholdfloat or None (default: 0)

threshold on candidate scores. If k is not None, k best candidates are returned independently of the value of threshold

Parameters

candidateslist of tuples of size 3

list of (item, user, score) candidates (by name as present in the dataset)

class stanscofi.models.LogisticRegression(params)

Bases: BasicModel

Logistic Regression (calls sklearn.linear_model.LogisticRegression internally). It uses the very same parameters as sklearn.linear_model.LogisticRegression, so please refer to help(sklearn.linear_model.LogisticRegression).

Parameters

paramsdict

dictionary which contains sklearn.linear_model.LogisticRegression parameters, plus a key called “preprocessing” which determines which preprocessing function (in stanscofi.preprocessing) should be applied to data, plus a key called “subset” which gives the maximum number of features to consider in the model (those features will be the Top-subset in terms of variance across samples)

Attributes

Same as BasicModel class

Methods

Same as BasicModel class preprocessing(train_dataset)

Preprocesses the input dataset into something that is an input to fit

model_fit(train_dataset)

Preprocesses and fits the model

model_predict_proba(test_dataset)

Outputs predictions of the fitted model on test_dataset

model_fit(X, y)

Fitting the Logistic Regression model on the training dataset.

Parameters

Xarray-like of shape (n_ratings, n_pair_features)

(user, item) feature matrix (the actual contents of the matrix depends on parameters “preprocessing” and “subset” given as input

yarray-like of shape (n_ratings, )

response vector for each (user, item) pair

model_predict_proba(X)

Making predictions using the Logistic Regression model on the testing dataset.

Parameters

Xarray-like of shape (n_ratings, n_pair_features)

(user, item) feature matrix (the actual contents of the matrix depends on parameters “preprocessing” and “subset” given as input

preprocessing(dataset, is_training=True)

Preprocessing step, which converts elements of a dataset (ratings matrix, user feature matrix, item feature matrix) into appropriate inputs to the Logistic Regression classifier.

Parameters

datasetstanscofi.Dataset

dataset to convert

is_trainingbool

is the preprocessing prior to training (true) or testing (false)?

Returns

args : contains X : array-like of shape (n_ratings, n_pair_features)

(user, item) feature matrix (the actual contents of the matrix depends on parameters “preprocessing” and “subset” given as input

yarray-like of shape (n_ratings, )

response vector for each (user, item) pair

class stanscofi.models.NMF(params)

Bases: BasicModel

Non-negative Matrix Factorization (calls sklearn.decomposition.NMF internally). It uses the very same parameters as sklearn.decomposition.NMF, so please refer to help(sklearn.decomposition.NMF).

Parameters

paramsdict

dictionary which contains sklearn.decomposition.NMF parameters

Attributes

Same as BasicModel class

Methods

Same as BasicModel class preprocessing(train_dataset)

Preprocesses the input dataset into something that is an input to fit

model_fit(train_dataset)

Preprocesses and fits the model

model_predict_proba(test_dataset)

Outputs predictions of the fitted model on test_dataset

model_fit(input)

Fitting the NMF model on the preprocessed training dataset.

Parameters

inputarray-like of shape (n_samples,n_features)

training data

model_predict_proba(input)

Making predictions using the NMF model on the testing dataset.

Parameters

inputarray-like of shape (n_samples,n_features)

testing data

Returns

result : array-like of shape (n_samples,n_features)

preprocessing(dataset, is_training=True)

Preprocessing step, which converts elements of a dataset (ratings matrix, user feature matrix, item feature matrix) into appropriate inputs to the NMF classifier.

Parameters

datasetstanscofi.Dataset

dataset to convert

is_trainingbool

is the preprocessing prior to training (true) or testing (false)?

Returns

args : contains A : array-like of shape (n_users, n_items)

contains the transposed translated association matrix so that all its values are non-negative

stanscofi.preprocessing module

class stanscofi.preprocessing.CustomScaler(posinf, neginf)

Bases: object

A class used to encode a simple preprocessing pipeline for feature matrices. Does mean imputation for features, feature filtering, correction of infinity errors and standardization

Parameters

posinfint

Value to replace infinity (positive) values

neginfint

Value to replace infinity (negative) values

Attributes

imputerNone or sklearn.impute.SimpleImputer instance

Class for imputation of values

scalerNone or sklearn.preprocessing.StandardScaler

Class for standardization of values

filterNone or list

List of selected features (Top-N in terms of variance)

Methods

__init__(params)

Initialize the scaler (with unfitted attributes)

fit_transform(mat, subset=None, verbose=False)

Fits classes and transforms a matrix

fit_transform(mat, subset=None, verbose=False)

Fits each attribute of the scaler and transform a feature matrix. Does mean imputation for features, feature filtering, correction of infinity errors and standardization

Parameters

matarray-like of shape (n_samples, n_features)

matrix which should be preprocessed

subsetNone or int

number of features to keep in feature matrix (Top-N in variance); if it is None, attribute filter is either initialized (if it is equal to None) or used to filter features

verbosebool

prints out information

Returns

mat_nanarray-like of shape (n_samples, n_features)

Preprocessed matrix

stanscofi.preprocessing.Perlman_procedure(dataset, njobs=1, sep_feature=None, missing=-666, inf=2, verbose=False)

Method for combining (several) item and user similarity matrices (reference DOI: 10.1089/cmb.2010.0213). Instead of concatenating item and user features for a given pair, resulting in a vector of size (n_items x n_item_matrices)+(n_users x n_user_matrices), compute a single score per pair of (item_matrix, user_matrix) for each (item, user) pair, resulting in a vector of size (n_item_matrices) x (n_user_matrices).

The score for any item i, user u, item-item similarity fi and user-user similarity fu is score_{fi,fu}(i,u) = max { sqrt(fi(dr, dr’) x fu(di’, di)) | (i’,u’)!=(i,u), fi(dr, dr’)!=NaN, fu(di’, di)!=NaN, rating(i’,u’)!=0 }

Then the final feature matrix is X = (X_{i,j})_{i,j} for i a (item, user) pair and j a (item similarity, user similarity) pair

Parameters

datasetstanscofi.Dataset
dataset which should be transformed, with n_items items (with n_item_features features) and n_users users (with n_user_features features) with the following attributes
ratingsCOO-array of shape (n_items, n_users)

an array which contains values in {-1, 0, 1} describing the negative, unlabelled/unavailable, positive user-item matchings

itemsCOO-array of shape (n_item_features, n_items)

concatenation of n_drug_features drug similarity matrices of shape (n_drugs, n_drugs), where values in item_features are denoted by “<feature><sep_feature><drug>” and missing values are denoted by numpy.nan; if the prefix in “<feature><sep_feature>” is missing, it is assumed that items is a single similarity matrix (n_item_matrices=1)

usersCOO-array of shape (n_user_features, n_users)

concatenation of n_disease_features drug similarity matrices of shape (n_diseases, n_diseases), where values in user_features are denoted by “<feature><sep_feature><disease>” and missing values are denoted by numpy.nan; if the prefix in “<feature><sep_feature>” is missing, it is assumed that users is a single similarity matrix (n_user_matrices=1)

NaN values are replaced by 0, whereas infinite values are replaced by inf (parameter below).

njobsint

number of jobs to run in parallel

sep_featurestr or None

separator between feature type and element in the feature matrices in dataset. None if there is one single feature type expected

missingint

placeholder value that should be different from any feature name

infint

Value that replaces infinite values in the dataset (inf for +infinity, -inf for -infinity)

verbosebool

prints out information

Returns

Xarray-like of shape (n_items x n_users, n_item_features x n_user_features)

the feature matrix

yarray-like of shape (n_items x n_users, )

the response/outcome vector

stanscofi.preprocessing.cartesian_product_transpose(*arrays)
stanscofi.preprocessing.meanimputation_standardize(dataset, subset=None, scalerS=None, scalerP=None, inf=10, verbose=False)

Computes a single feature matrix and response vector from a drug repurposing dataset, by imputation by the average value of a feature for missing values and by centering and standardizing user and item feature matrices and concatenating them

Parameters

datasetstanscofi.Dataset

dataset which should be transformed, with n_items items (with n_item_features features) and n_users users (with n_user_features features) where missing values are denoted by numpy.nan

subsetNone or int

number of features to keep in item feature matrix, and in user feature matrix (selecting the ones with highest variance)

scalerSNone or sklearn.preprocessing.StandardScaler instance

scaler for items

scalerPNone or sklearn.preprocessing.StandardScaler instance

scaler for users

verbosebool

prints out information

Returns

Xarray-like of shape (n_folds, n_item_features+n_user_features)

the feature matrix

yarray-like of shape (n_folds, )

the response/outcome vector

scalerSNone or stanscofi.models.CustomScaler instance

scaler for items; if the input value was None, returns the scaler fitted on item feature vectors

scalerPNone or stanscofi.models.CustomScaler instance

scaler for users; if the input value was None, returns the scaler fitted on user feature vectors

stanscofi.preprocessing.preprocessing_XY(dataset, preprocessing_str, operator='*', sep_feature='-', subset_=None, filter_=None, scalerS=None, scalerP=None, inf=2, njobs=1)

Converts a score vector or a score value into a list of scores

Parameters

datasetstanscofi.datasets.Dataset

dataset to preprocess

preprocessing_strstr

type of preprocessing: in [“Perlman_procedure”,”meanimputation_standardize”,”same_feature_preprocessing”].

subset_None or int

Number of features to restrict the dataset to (Top-subset_ features in terms of cross-sample variance) /!across user and item features if preprocessing_str!=”meanimputation_standardize” otherwise 2*subset_ features are preserved (subset_ for item features, subset_ for user features)

operatorNone or str

arithmetric operation to apply, ex. “+”, “*”

sep_featurestr

separator between feature type and element in the feature matrices in dataset

filter_None or list

list of feature indices to keep (of length subset_) (overrides the subset_ parameter if both are fed)

scalerSNone or stanscofi.models.CustomScaler instance

scaler for items; the scaler fitted on item feature vectors

scalerPNone or stanscofi.models.CustomScaler instance

scaler for users; the scaler fitted on user feature vectors

inffloat or int

placeholder value for infinity values (positive : +inf, negative : -inf)

njobsint

number of jobs to run in parallel (njobs > 0) for the Perlman procedure

Returns

Xarray-like of shape (n_folds, n_features)

the feature matrix

yarray-like of shape (n_folds, )

the response/outcome vector

scalerSNone or stanscofi.models.CustomScaler instance

scaler for items; if the input value was None, returns the scaler fitted on item feature vectors

scalerPNone or stanscofi.models.CustomScaler instance

scaler for users; if the input value was None, returns the scaler fitted on user feature vectors

filter_None or list

list of feature indices to keep (of length subset_)

stanscofi.preprocessing.same_feature_preprocessing(dataset, operator)

If the users and items have the same features in the dataset, then a simple way to combine the user and item feature matrices is to apply an element-wise arithmetic operator (*, +, etc.) to the feature vectors coefficient per coefficient.

Parameters

datasetstanscofi.Dataset

dataset which should be transformed, where n_item_features==n_user_features and dataset.same_item_user_features==True

operatorstr

arithmetric operation to apply, ex. “+”, “*”

Returns

Xarray-like of shape (n_folds, n_features)

the feature matrix

yarray-like of shape (n_folds, )

the response/outcome vector

stanscofi.training_testing module

stanscofi.training_testing.cv_training(template, params, train_dataset, nsplits, metric, k=1, beta=1, threshold=0, test_size=0.2, dist_type='cosine', cv_type='random', early_stop=2, njobs=1, random_state=1234, show_plots=False, verbose=False)

Trains a model on a dataset using cross-validation and custom metrics using sklearn.model_selection.StratifiedKFold

Parameters

templatestanscofi.BasicModel or subclass

type of model to train

paramsdict

dictionary of parameters to initialize the model

train_datasetstanscofi.Dataset

dataset to train upon

nsplitsint

number of cross-validation steps

metricstr

metric to optimize the model upon. Implemented metrics are in validation.py

kint (default: 1)

Argument of the metric to optimize. Implemented metrics are in validation.py

betafloat (default: 1)

Argument of the metric to optimize. Implemented metrics are in validation.py

thresholdfloat (default: 0)

decision threshold

test_sizefloat (default: 0.2)

percentage of testing set (if cv_type=”weakly_correlated”)

dist_typestr (default: “cosine”)

type of metric for splitting (if cv_type=”weakly_correlated”)

cv_typestr (default: “random”)

type of split to apply to the dataset. Can either be “random” or “weakly_correlated”

early_stopint or None

positive integer, which stops the cluster number search after 3 tries yielding the same number; note that if early_stop is not None, then the property on test_size will not necessarily hold anymore

njobsint (default: 1)

number of jobs to run in parallel. Should be lower than nsplits-1

random_stateint (default: 1234)

random seed

show_plotsbool (default: False)

shows the validation plots at each cross-validation step

verbosebool (default: False)

prints out information

Returns

resultsdict
a dictionary which contains
“models”list of subinstances of stanscofi.models.BasicModel of length nsplits

all trained models

“train_metric”list of floats of length nsplits

all metrics on training sets

“test_metric”list of floats of length nsplits

all metrics on testing sets

“cv_folds”list of COO-array of shape (n_items, n_users) of length nsplits

the training and testing folds for each split

Grid-search over hyperparameters, iteratively optimizing over one parameter at a time, and internally calling cv_training.

Parameters

search_paramsdict

a dictionary which contains as keys the hyperparameter names and as values the corresponding intervals to explore during the grid-search

templatestanscofi.BasicModel or subclass

type of model to train

paramsdict

dictionary of parameters to initialize the model

train_datasetstanscofi.Dataset

dataset to train upon

metricstr

metric to optimize the model upon. Implemented metrics are in validation.py

kint (default: 1)

Argument of the metric to optimize. Implemented metrics are in validation.py

betafloat (default: 1)

Argument of the metric to optimize. Implemented metrics are in validation.py

thresholdfloat (default: 0)

decision threshold

test_sizefloat (default: 0.2)

percentage of testing set (if cv_type=”weakly_correlated”)

dist_typestr (default: “cosine”)

type of metric for splitting (if cv_type=”weakly_correlated”)

cv_typestr (default: “random”)

type of split to apply to the dataset. Can either be “random” or “weakly_correlated”

njobsint (default: 1)

number of jobs to run in parallel. Should be lower than nsplits-1

random_stateint (default: 1234)

random seed

show_plotsbool (default: False)

shows the validation plots at each cross-validation step

verbosebool (default: False)

prints out information

Returns

best_paramsdict

a dictionary which contains as keys the hyperparameter names and as values the best values obtained across all grid-search steps

best_modelsubinstance of stanscofi.models.BasicModel

the best trained model associated with the best parameters

metricsdict
a dictionary which contains
“train_metric”float

the metric on the training set on the best crossvalidation split for the best set of parameters

“test_metric”float

the metric on the testing set on the best crossvalidation split for the best set of parameters

stanscofi.training_testing.indices_to_folds(indices, indices_array, shape)

Converts indices of datapoints into folds as defined in stanscofi

Parameters

indicesarray-like of size (n_selected_ratings, )

flat indices of selected datapoints

indices_arrayarray-like of size (n_total_ratings, 2)

corresponding row and column indices of datapoints

shapetuple of integers of size 2

total numbers of rows and columns

Returns

foldsCOO-array of shape shape

folds which can be fed to other functions in stanscofi, e.g., dataset.subset(folds)

stanscofi.training_testing.random_cv_split(dataset, cv_generator, metric='cosine')

Splits the data into training and testing datasets randomly for cross-validation.

Parameters

datasetstanscofi.Dataset

dataset to split

cv_generatorscikit-learn cross-validation index generator

e.g. StratifiedKFold, KFold

metricstr

metric to consider to assess distance between training and testing sets. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

Returns

cv_foldslist of size nsplits of COO-array of shape (n_items, n_users)

list of arrays which contain values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set

dist_lstlist of size nsplits of tuples of float of size 3

for each fold, minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set

stanscofi.training_testing.random_simple_split(dataset, test_size, metric='cosine', random_state=1234)

Splits the data into training and testing datasets randomly.

Parameters

datasetstanscofi.Dataset

dataset to split

test_sizefloat

value between 0 and 1 (strictly) which indicates the maximum percentage of initial data (positive and negative ratings) being assigned to the test dataset

metricstr

metric to consider to assess distance between training and testing sets. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

random_stateint

random seed

Returns

cv_foldslist of COO-array of shape (n_items, n_users)

list of arrays which contain values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set

dist_train_test, dist_train, dist_testfloat

minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set

stanscofi.training_testing.weakly_correlated_split(dataset, test_size, early_stop=None, metric='cosine', random_state=1234, niter=100, verbose=False)

Splits the data into training and testing datasets with a low correlation among items, by applying a hierarchical clustering on the item feature matrix. NaNs in the item feature matrix are converted to 0.

Parameters

datasetstanscofi.Dataset

dataset to split

test_sizefloat

value between 0 and 1 (strictly) which indicates the maximum percentage of initial data (positive and negative ratings) being assigned to the test dataset

early_stopint or None

positive integer, which stops the cluster number search after 3 tries yielding the same number; note that if early_stop is not None, then the property on test_size will not necessarily hold anymore

metricstr

metric to consider to perform hierarchical clustering on the dataset. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

random_stateint

random seed

niterint

maximum number of iterations of the clustering loop

verbosebool

prints out information

Returns

train_folds, test_foldsCOO-array of shape (n_items, n_users)

an array which contains values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set

dist_train_test, dist_train, dist_testfloat

minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set

stanscofi.utils module

stanscofi.utils.compute_sparsity(df)

Computes the sparsity number of a collaborative filtering dataset

Parameters

dfpandas.DataFrame of shape (n_items, n_users)

the matrix of ratings where unknown matchings are denoted with 0

Returns

sparsityfloat

the percentage of non missing values in the matrix of ratings

stanscofi.utils.load_dataset(model_name, save_folder='./', sep_feature='-')

Loads a drug repurposing dataset

Parameters

model_namestr

the name of the dataset to load. Should belong to the following list: [“Gottlieb”, “DNdataset”, “Cdataset”, “LRSSL”, “PREDICT_Gottlieb”, “TRANSCRIPT”, “PREDICT”]

save_folderstr

the path to the folder where dataset-related files are or will be stored

Returns

dataset_didictionary

a dictionary where key “ratings” contains the drug-disease matching pandas.DataFrame of shape (n_drugs, n_diseases) (where missing values are denoted by 0), key “users” correspond to the disease pandas.DataFrame of shape (n_disease_features, n_diseases), and “items” correspond to the drug feature pandas.DataFrame of shape (n_drug_features, n_drugs)

stanscofi.utils.matrix2ratings(df, user_col='user', item_col='item', rating_col='rating')

Converts a matrix into a list of ratings

Parameters

dfpandas.DataFrame of shape (n_items, n_users)

the matrix of ratings in {-1, 1, 0} where unknown matchings are denoted with 0

user_colstr

column denoting users

item_colstr

column denoting items

rating_colstr

column denoting ratings in {-1, 0, 1}

Returns

ratingspandas.DataFrame of shape (n_ratings, 3)

the list of known ratings where the first column correspond to users, second to items, third to ratings

stanscofi.utils.merge_ratings(rating_dfs, user_col, item_col, rating_col)

Merges rating lists from several sources by solving conflicts. Conflicting ratings are resolved as follows: if there is at least one negative rating (-1) reported for a (drug, disease) pair, then the final rating is negative (-1); if there is at least one positive rating (1) and no negative rating (-1) reported, then the final rating is positive (1)

Parameters

rating_dfslist of pandas.DataFrame of shape (n_ratings, 3)

the list of rating lists where one column (of name user_col) is associated with users, one column (of name item_col) is associated with items, and one column (of name rating_col) is associated with ratings in {-1, 0, 1}

user_colstr

column denoting users

item_colstr

column denoting items

rating_colstr

column denoting ratings in {-1, 0, 1}

verbose : bool

Returns

rating_dfpandas.DataFrame of shape (n_ratings, 3)

the list of rating lists where one column (of name user_col) is associated with users, one column (of name item_col) is associated with items, and one column (of name rating_col) is associated with ratings in {-1, 0, 1}

stanscofi.utils.print_dataset(ratings, user_col, item_col, rating_col)

Prints values of a drug repurposing dataset

Parameters

ratingspandas.DataFrame of shape (n_ratings, 3)

the list of ratings with columns user_col, item_col, rating_col

user_colstr

column denoting users

item_colstr

column denoting items

rating_colstr

column denoting ratings in {-1, 0, 1}

Returns

None

Prints

The number of items/drugs, users/diseases, and the number of positive (1), negative (-1) and unknown (0) matchings.

stanscofi.utils.ratings2matrix(ratings, user_col, item_col, rating_col)

Converts a list of ratings into a matrix

Parameters

ratingspandas.DataFrame of shape (n_ratings, 3)

the list of known ratings where the first column (user_col) correspond to users, second (item_col) to items, third (rating_col) to ratings in {-1,0,1}

user_colstr

column denoting users

item_colstr

column denoting items

rating_colstr

column denoting ratings in {-1, 0, 1}

Returns

dfpandas.DataFrame of shape (n_items, n_users)

the matrix of ratings in {-1, 1, 0} where unknown matchings are denoted with 0

stanscofi.validation module

stanscofi.validation.AP(y_true, y_pred, u, u1)
stanscofi.validation.AUC(y_true, y_pred, k, u1)
stanscofi.validation.DCGk(y_true, y_pred, k, u1)
stanscofi.validation.ERR(y_true, y_pred, max=10, max_grade=2)

source: https://raw.githubusercontent.com/skondo/evaluation_measures/master/evaluations_measures.py

stanscofi.validation.F1K(y_true, y_pred, k, u1)
stanscofi.validation.Fscore(y_true, y_pred, u, beta)
stanscofi.validation.HRk(y_true, y_pred, k, u1)
stanscofi.validation.MAP(y_true, y_pred, u, u1)
stanscofi.validation.MRR(y_true, y_pred, u, u1)
stanscofi.validation.MeanRank(y_true, y_pred, k, u1)
stanscofi.validation.NDCGk(y_true, y_pred, k, u1)
stanscofi.validation.PrecisionK(y_true, y_pred, k, u1)
stanscofi.validation.RP(y_true, y_pred, u, u1)
stanscofi.validation.RecallK(y_true, y_pred, k, u1)
stanscofi.validation.Rscore(y_true, y_pred, u, u1)
stanscofi.validation.TAU(y_true, y_pred, u, u1)
stanscofi.validation.compute_metrics(scores, predictions, dataset, metrics, k=1, beta=1, verbose=False)

Computes user-wise validation metrics for a given set of scores and predictions w.r.t. a dataset

Parameters

scoresCOO-array of shape (n_items, n_users)

sparse matrix in COOrdinate format

predictionsCOO-array of shape (n_items, n_users)

sparse matrix in COOrdinate format with values in {-1,1}

datasetstanscofi.Dataset

dataset on which the metrics should be computed

metricslst of str

list of metrics which should be computed

kint (default: 1)

Argument of the metric to optimize. Implemented metrics are in validation.py

betafloat (default: 1)

Argument of the metric to optimize. Implemented metrics are in validation.py

verbosebool

prints out information about ignored users for the computation of validation metrics, that is, users which pairs are only associated to a single class (i.e., all pairs with this users are either assigned 0, -1 or 1)

Returns

metricspandas.DataFrame of shape (len(metrics), 2)

table of metrics: metrics in rows, average and standard deviation across users in columns

plots_argsdict

dictionary of arguments to feed to the plot_metrics function to plot the Precision-Recall and the Receiver Operating Chracteristic (ROC) curves

stanscofi.validation.plot_metrics(y_true=None, y_pred=None, scores=None, ground_truth=None, predictions=None, aucs=None, fscores=None, tprs=None, recs=None, figsize=(16, 5), model_name='Model')

Plots the ROC curve, the Precision-Recall curve, the boxplot of predicted scores and the piechart of classes associated to the predictions y_pred in input w.r.t. ground truth y_true

Parameters

y_truearray-like of shape (n_ratings,)

an array which contains the binary ground truth labels in {0,1}

y_predarray-like of shape (n_ratings,)

an array which contains the binary predicted labels in {0,1}

scoresarray-like of shape (n_ratings,)

an array which contains the predicted scores

ground_trutharray-like of shape (n_ratings,)

an array which contains the ground truth labels in {-1,0,1}

predictionsarray-like of shape (n_ratings,)

an array which contains the predicted labels in {-1,0,1}

aucslist

list of AUCs per user

fscoreslist

list of F-scores per user

tprsarray-like of shape (n_thresholds,)

Increasing true positive rates such that element i is the true positive rate of predictions with score >= thresholds[i], where thresholds was given as input to sklearn.metrics.roc_curve

recsarray-like of shape (n_thresholds,)

Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i] and the last element is 0, where thresholds was given as input to sklearn.metrics.precision_recall_curve

figsizetuple of size 2

width and height of the figure

model_namestr

model which predicted the ratings

Returns

metricspandas.DataFrame of shape (2, 2)

table of metrics: AUC, F_beta score in rows, average and standard deviation across users in columns

plots_argsdict

dictionary of arguments to feed to the plot_metrics function

Module contents