
stanscofi.training_testing.cv_training(template, params, train_dataset, nsplits, metric, k=1, beta=1, threshold=0, test_size=0.2, dist_type='cosine', cv_type='random', early_stop=2, njobs=1, random_state=1234, show_plots=False, verbose=False)

Trains a model on a dataset using cross-validation and custom metrics using sklearn.model_selection.StratifiedKFold


templatestanscofi.BasicModel or subclass

type of model to train


dictionary of parameters to initialize the model


dataset to train upon


number of cross-validation steps


metric to optimize the model upon. Implemented metrics are in

kint (default: 1)

Argument of the metric to optimize. Implemented metrics are in

betafloat (default: 1)

Argument of the metric to optimize. Implemented metrics are in

thresholdfloat (default: 0)

decision threshold

test_sizefloat (default: 0.2)

percentage of testing set (if cv_type=”weakly_correlated”)

dist_typestr (default: “cosine”)

type of metric for splitting (if cv_type=”weakly_correlated”)

cv_typestr (default: “random”)

type of split to apply to the dataset. Can either be “random” or “weakly_correlated”

early_stopint or None

positive integer, which stops the cluster number search after 3 tries yielding the same number; note that if early_stop is not None, then the property on test_size will not necessarily hold anymore

njobsint (default: 1)

number of jobs to run in parallel. Should be lower than nsplits-1

random_stateint (default: 1234)

random seed

show_plotsbool (default: False)

shows the validation plots at each cross-validation step

verbosebool (default: False)

prints out information


a dictionary which contains
“models”list of subinstances of stanscofi.models.BasicModel of length nsplits

all trained models

“train_metric”list of floats of length nsplits

all metrics on training sets

“test_metric”list of floats of length nsplits

all metrics on testing sets

“cv_folds”list of COO-array of shape (n_items, n_users) of length nsplits

the training and testing folds for each split

Grid-search over hyperparameters, iteratively optimizing over one parameter at a time, and internally calling cv_training.



a dictionary which contains as keys the hyperparameter names and as values the corresponding intervals to explore during the grid-search

templatestanscofi.BasicModel or subclass

type of model to train


dictionary of parameters to initialize the model


dataset to train upon


metric to optimize the model upon. Implemented metrics are in

kint (default: 1)

Argument of the metric to optimize. Implemented metrics are in

betafloat (default: 1)

Argument of the metric to optimize. Implemented metrics are in

thresholdfloat (default: 0)

decision threshold

test_sizefloat (default: 0.2)

percentage of testing set (if cv_type=”weakly_correlated”)

dist_typestr (default: “cosine”)

type of metric for splitting (if cv_type=”weakly_correlated”)

cv_typestr (default: “random”)

type of split to apply to the dataset. Can either be “random” or “weakly_correlated”

njobsint (default: 1)

number of jobs to run in parallel. Should be lower than nsplits-1

random_stateint (default: 1234)

random seed

show_plotsbool (default: False)

shows the validation plots at each cross-validation step

verbosebool (default: False)

prints out information



a dictionary which contains as keys the hyperparameter names and as values the best values obtained across all grid-search steps

best_modelsubinstance of stanscofi.models.BasicModel

the best trained model associated with the best parameters

a dictionary which contains

the metric on the training set on the best crossvalidation split for the best set of parameters


the metric on the testing set on the best crossvalidation split for the best set of parameters

stanscofi.training_testing.indices_to_folds(indices, indices_array, shape)

Converts indices of datapoints into folds as defined in stanscofi


indicesarray-like of size (n_selected_ratings, )

flat indices of selected datapoints

indices_arrayarray-like of size (n_total_ratings, 2)

corresponding row and column indices of datapoints

shapetuple of integers of size 2

total numbers of rows and columns


foldsCOO-array of shape shape

folds which can be fed to other functions in stanscofi, e.g., dataset.subset(folds)

stanscofi.training_testing.random_cv_split(dataset, cv_generator, metric='cosine')

Splits the data into training and testing datasets randomly for cross-validation.



dataset to split

cv_generatorscikit-learn cross-validation index generator

e.g. StratifiedKFold, KFold


metric to consider to assess distance between training and testing sets. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]


cv_foldslist of size nsplits of COO-array of shape (n_items, n_users)

list of arrays which contain values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set

dist_lstlist of size nsplits of tuples of float of size 3

for each fold, minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set

stanscofi.training_testing.random_simple_split(dataset, test_size, metric='cosine', random_state=1234)

Splits the data into training and testing datasets randomly.



dataset to split


value between 0 and 1 (strictly) which indicates the maximum percentage of initial data (positive and negative ratings) being assigned to the test dataset


metric to consider to assess distance between training and testing sets. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]


random seed


cv_foldslist of COO-array of shape (n_items, n_users)

list of arrays which contain values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set

dist_train_test, dist_train, dist_testfloat

minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set

stanscofi.training_testing.weakly_correlated_split(dataset, test_size, early_stop=None, metric='cosine', random_state=1234, niter=100, verbose=False)

Splits the data into training and testing datasets with a low correlation among items, by applying a hierarchical clustering on the item feature matrix. NaNs in the item feature matrix are converted to 0.



dataset to split


value between 0 and 1 (strictly) which indicates the maximum percentage of initial data (positive and negative ratings) being assigned to the test dataset

early_stopint or None

positive integer, which stops the cluster number search after 3 tries yielding the same number; note that if early_stop is not None, then the property on test_size will not necessarily hold anymore


metric to consider to perform hierarchical clustering on the dataset. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]


random seed


maximum number of iterations of the clustering loop


prints out information


train_folds, test_foldsCOO-array of shape (n_items, n_users)

an array which contains values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set

dist_train_test, dist_train, dist_testfloat

minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set