Training/Testing

stanscofi.training_testing.cv_training(template, params, train_dataset, nsplits, metric, k=1, beta=1, threshold=0, test_size=0.2, dist_type='cosine', cv_type='random', early_stop=2, njobs=1, random_state=1234, show_plots=False, verbose=False)

Trains a model on a dataset using cross-validation and custom metrics using sklearn.model_selection.StratifiedKFold

…

Parameters

templatestanscofi.BasicModel or subclass: type of model to train
paramsdict: dictionary of parameters to initialize the model
train_datasetstanscofi.Dataset: dataset to train upon
nsplitsint: number of cross-validation steps
metricstr: metric to optimize the model upon. Implemented metrics are in validation.py
kint (default: 1): Argument of the metric to optimize. Implemented metrics are in validation.py
betafloat (default: 1): Argument of the metric to optimize. Implemented metrics are in validation.py
thresholdfloat (default: 0): decision threshold
test_sizefloat (default: 0.2): percentage of testing set (if cv_type=”weakly_correlated”)
dist_typestr (default: “cosine”): type of metric for splitting (if cv_type=”weakly_correlated”)
cv_typestr (default: “random”): type of split to apply to the dataset. Can either be “random” or “weakly_correlated”
early_stopint or None: positive integer, which stops the cluster number search after 3 tries yielding the same number; note that if early_stop is not None, then the property on test_size will not necessarily hold anymore
njobsint (default: 1): number of jobs to run in parallel. Should be lower than nsplits-1
random_stateint (default: 1234): random seed
show_plotsbool (default: False): shows the validation plots at each cross-validation step
verbosebool (default: False): prints out information

Returns

resultsdict

a dictionary which contains

“models”list of subinstances of stanscofi.models.BasicModel of length nsplits: all trained models
“train_metric”list of floats of length nsplits: all metrics on training sets
“test_metric”list of floats of length nsplits: all metrics on testing sets
“cv_folds”list of COO-array of shape (n_items, n_users) of length nsplits: the training and testing folds for each split

stanscofi.training_testing.grid_search(search_params, template, params, train_dataset, nsplits, metric, k=1, beta=1, threshold=0, test_size=0.2, dist_type='cosine', cv_type='random', early_stop=2, njobs=1, random_state=1234, show_plots=False, verbose=False)

Grid-search over hyperparameters, iteratively optimizing over one parameter at a time, and internally calling cv_training.

…

Parameters

search_paramsdict: a dictionary which contains as keys the hyperparameter names and as values the corresponding intervals to explore during the grid-search
templatestanscofi.BasicModel or subclass: type of model to train
paramsdict: dictionary of parameters to initialize the model
train_datasetstanscofi.Dataset: dataset to train upon
metricstr: metric to optimize the model upon. Implemented metrics are in validation.py
kint (default: 1): Argument of the metric to optimize. Implemented metrics are in validation.py
betafloat (default: 1): Argument of the metric to optimize. Implemented metrics are in validation.py
thresholdfloat (default: 0): decision threshold
test_sizefloat (default: 0.2): percentage of testing set (if cv_type=”weakly_correlated”)
dist_typestr (default: “cosine”): type of metric for splitting (if cv_type=”weakly_correlated”)
cv_typestr (default: “random”): type of split to apply to the dataset. Can either be “random” or “weakly_correlated”
njobsint (default: 1): number of jobs to run in parallel. Should be lower than nsplits-1
random_stateint (default: 1234): random seed
show_plotsbool (default: False): shows the validation plots at each cross-validation step
verbosebool (default: False): prints out information

Returns

best_paramsdict

a dictionary which contains as keys the hyperparameter names and as values the best values obtained across all grid-search steps

best_modelsubinstance of stanscofi.models.BasicModel

the best trained model associated with the best parameters

metricsdict

a dictionary which contains

“train_metric”float: the metric on the training set on the best crossvalidation split for the best set of parameters
“test_metric”float: the metric on the testing set on the best crossvalidation split for the best set of parameters

stanscofi.training_testing.indices_to_folds(indices, indices_array, shape)

Converts indices of datapoints into folds as defined in stanscofi

…

Parameters

indicesarray-like of size (n_selected_ratings, ): flat indices of selected datapoints
indices_arrayarray-like of size (n_total_ratings, 2): corresponding row and column indices of datapoints
shapetuple of integers of size 2: total numbers of rows and columns

Returns

foldsCOO-array of shape shape: folds which can be fed to other functions in stanscofi, e.g., dataset.subset(folds)

stanscofi.training_testing.random_cv_split(dataset, cv_generator, metric='cosine')

Splits the data into training and testing datasets randomly for cross-validation.

…

Parameters

datasetstanscofi.Dataset: dataset to split
cv_generatorscikit-learn cross-validation index generator: e.g. StratifiedKFold, KFold
metricstr: metric to consider to assess distance between training and testing sets. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

Returns

cv_foldslist of size nsplits of COO-array of shape (n_items, n_users): list of arrays which contain values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set
dist_lstlist of size nsplits of tuples of float of size 3: for each fold, minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set

stanscofi.training_testing.random_simple_split(dataset, test_size, metric='cosine', random_state=1234)

Splits the data into training and testing datasets randomly.

…

Parameters

datasetstanscofi.Dataset: dataset to split
test_sizefloat: value between 0 and 1 (strictly) which indicates the maximum percentage of initial data (positive and negative ratings) being assigned to the test dataset
metricstr: metric to consider to assess distance between training and testing sets. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
random_stateint: random seed

Returns

cv_foldslist of COO-array of shape (n_items, n_users): list of arrays which contain values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set
dist_train_test, dist_train, dist_testfloat: minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set

stanscofi.training_testing.weakly_correlated_split(dataset, test_size, early_stop=None, metric='cosine', random_state=1234, niter=100, verbose=False)

Splits the data into training and testing datasets with a low correlation among items, by applying a hierarchical clustering on the item feature matrix. NaNs in the item feature matrix are converted to 0.

…

Parameters

datasetstanscofi.Dataset: dataset to split
test_sizefloat: value between 0 and 1 (strictly) which indicates the maximum percentage of initial data (positive and negative ratings) being assigned to the test dataset
early_stopint or None: positive integer, which stops the cluster number search after 3 tries yielding the same number; note that if early_stop is not None, then the property on test_size will not necessarily hold anymore
metricstr: metric to consider to perform hierarchical clustering on the dataset. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
random_stateint: random seed
niterint: maximum number of iterations of the clustering loop
verbosebool: prints out information

Returns

train_folds, test_foldsCOO-array of shape (n_items, n_users): an array which contains values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set
dist_train_test, dist_train, dist_testfloat: minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set