Training/Testing

stanscofi.training_testing.cv_training(template, params, train_dataset, nsplits, metric, k=1, beta=1, threshold=0, test_size=0.2, dist_type='cosine', cv_type='random', early_stop=2, njobs=1, random_state=1234, show_plots=False, verbose=False)

Trains a model on a dataset using cross-validation and custom metrics using sklearn.model_selection.StratifiedKFold

Parameters

templatestanscofi.BasicModel or subclass

type of model to train

paramsdict

dictionary of parameters to initialize the model

train_datasetstanscofi.Dataset

dataset to train upon

nsplitsint

number of cross-validation steps

metricstr

metric to optimize the model upon. Implemented metrics are in validation.py

kint (default: 1)

Argument of the metric to optimize. Implemented metrics are in validation.py

betafloat (default: 1)

Argument of the metric to optimize. Implemented metrics are in validation.py

thresholdfloat (default: 0)

decision threshold

test_sizefloat (default: 0.2)

percentage of testing set (if cv_type=”weakly_correlated”)

dist_typestr (default: “cosine”)

type of metric for splitting (if cv_type=”weakly_correlated”)

cv_typestr (default: “random”)

type of split to apply to the dataset. Can either be “random” or “weakly_correlated”

early_stopint or None

positive integer, which stops the cluster number search after 3 tries yielding the same number; note that if early_stop is not None, then the property on test_size will not necessarily hold anymore

njobsint (default: 1)

number of jobs to run in parallel. Should be lower than nsplits-1

random_stateint (default: 1234)

random seed

show_plotsbool (default: False)

shows the validation plots at each cross-validation step

verbosebool (default: False)

prints out information

Returns

resultsdict
a dictionary which contains
“models”list of subinstances of stanscofi.models.BasicModel of length nsplits

all trained models

“train_metric”list of floats of length nsplits

all metrics on training sets

“test_metric”list of floats of length nsplits

all metrics on testing sets

“cv_folds”list of COO-array of shape (n_items, n_users) of length nsplits

the training and testing folds for each split

Grid-search over hyperparameters, iteratively optimizing over one parameter at a time, and internally calling cv_training.

Parameters

search_paramsdict

a dictionary which contains as keys the hyperparameter names and as values the corresponding intervals to explore during the grid-search

templatestanscofi.BasicModel or subclass

type of model to train

paramsdict

dictionary of parameters to initialize the model

train_datasetstanscofi.Dataset

dataset to train upon

metricstr

metric to optimize the model upon. Implemented metrics are in validation.py

kint (default: 1)

Argument of the metric to optimize. Implemented metrics are in validation.py

betafloat (default: 1)

Argument of the metric to optimize. Implemented metrics are in validation.py

thresholdfloat (default: 0)

decision threshold

test_sizefloat (default: 0.2)

percentage of testing set (if cv_type=”weakly_correlated”)

dist_typestr (default: “cosine”)

type of metric for splitting (if cv_type=”weakly_correlated”)

cv_typestr (default: “random”)

type of split to apply to the dataset. Can either be “random” or “weakly_correlated”

njobsint (default: 1)

number of jobs to run in parallel. Should be lower than nsplits-1

random_stateint (default: 1234)

random seed

show_plotsbool (default: False)

shows the validation plots at each cross-validation step

verbosebool (default: False)

prints out information

Returns

best_paramsdict

a dictionary which contains as keys the hyperparameter names and as values the best values obtained across all grid-search steps

best_modelsubinstance of stanscofi.models.BasicModel

the best trained model associated with the best parameters

metricsdict
a dictionary which contains
“train_metric”float

the metric on the training set on the best crossvalidation split for the best set of parameters

“test_metric”float

the metric on the testing set on the best crossvalidation split for the best set of parameters

stanscofi.training_testing.indices_to_folds(indices, indices_array, shape)

Converts indices of datapoints into folds as defined in stanscofi

Parameters

indicesarray-like of size (n_selected_ratings, )

flat indices of selected datapoints

indices_arrayarray-like of size (n_total_ratings, 2)

corresponding row and column indices of datapoints

shapetuple of integers of size 2

total numbers of rows and columns

Returns

foldsCOO-array of shape shape

folds which can be fed to other functions in stanscofi, e.g., dataset.subset(folds)

stanscofi.training_testing.random_cv_split(dataset, cv_generator, metric='cosine')

Splits the data into training and testing datasets randomly for cross-validation.

Parameters

datasetstanscofi.Dataset

dataset to split

cv_generatorscikit-learn cross-validation index generator

e.g. StratifiedKFold, KFold

metricstr

metric to consider to assess distance between training and testing sets. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

Returns

cv_foldslist of size nsplits of COO-array of shape (n_items, n_users)

list of arrays which contain values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set

dist_lstlist of size nsplits of tuples of float of size 3

for each fold, minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set

stanscofi.training_testing.random_simple_split(dataset, test_size, metric='cosine', random_state=1234)

Splits the data into training and testing datasets randomly.

Parameters

datasetstanscofi.Dataset

dataset to split

test_sizefloat

value between 0 and 1 (strictly) which indicates the maximum percentage of initial data (positive and negative ratings) being assigned to the test dataset

metricstr

metric to consider to assess distance between training and testing sets. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

random_stateint

random seed

Returns

cv_foldslist of COO-array of shape (n_items, n_users)

list of arrays which contain values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set

dist_train_test, dist_train, dist_testfloat

minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set

stanscofi.training_testing.weakly_correlated_split(dataset, test_size, early_stop=None, metric='cosine', random_state=1234, niter=100, verbose=False)

Splits the data into training and testing datasets with a low correlation among items, by applying a hierarchical clustering on the item feature matrix. NaNs in the item feature matrix are converted to 0.

Parameters

datasetstanscofi.Dataset

dataset to split

test_sizefloat

value between 0 and 1 (strictly) which indicates the maximum percentage of initial data (positive and negative ratings) being assigned to the test dataset

early_stopint or None

positive integer, which stops the cluster number search after 3 tries yielding the same number; note that if early_stop is not None, then the property on test_size will not necessarily hold anymore

metricstr

metric to consider to perform hierarchical clustering on the dataset. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

random_stateint

random seed

niterint

maximum number of iterations of the clustering loop

verbosebool

prints out information

Returns

train_folds, test_foldsCOO-array of shape (n_items, n_users)

an array which contains values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set

dist_train_test, dist_train, dist_testfloat

minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set