Training/Testing
- stanscofi.training_testing.cv_training(template, params, train_dataset, nsplits, metric, k=1, beta=1, threshold=0, test_size=0.2, dist_type='cosine', cv_type='random', early_stop=2, njobs=1, random_state=1234, show_plots=False, verbose=False)
Trains a model on a dataset using cross-validation and custom metrics using sklearn.model_selection.StratifiedKFold
…
Parameters
- templatestanscofi.BasicModel or subclass
type of model to train
- paramsdict
dictionary of parameters to initialize the model
- train_datasetstanscofi.Dataset
dataset to train upon
- nsplitsint
number of cross-validation steps
- metricstr
metric to optimize the model upon. Implemented metrics are in validation.py
- kint (default: 1)
Argument of the metric to optimize. Implemented metrics are in validation.py
- betafloat (default: 1)
Argument of the metric to optimize. Implemented metrics are in validation.py
- thresholdfloat (default: 0)
decision threshold
- test_sizefloat (default: 0.2)
percentage of testing set (if cv_type=”weakly_correlated”)
- dist_typestr (default: “cosine”)
type of metric for splitting (if cv_type=”weakly_correlated”)
- cv_typestr (default: “random”)
type of split to apply to the dataset. Can either be “random” or “weakly_correlated”
- early_stopint or None
positive integer, which stops the cluster number search after 3 tries yielding the same number; note that if early_stop is not None, then the property on test_size will not necessarily hold anymore
- njobsint (default: 1)
number of jobs to run in parallel. Should be lower than nsplits-1
- random_stateint (default: 1234)
random seed
- show_plotsbool (default: False)
shows the validation plots at each cross-validation step
- verbosebool (default: False)
prints out information
Returns
- resultsdict
- a dictionary which contains
- “models”list of subinstances of stanscofi.models.BasicModel of length nsplits
all trained models
- “train_metric”list of floats of length nsplits
all metrics on training sets
- “test_metric”list of floats of length nsplits
all metrics on testing sets
- “cv_folds”list of COO-array of shape (n_items, n_users) of length nsplits
the training and testing folds for each split
- stanscofi.training_testing.grid_search(search_params, template, params, train_dataset, nsplits, metric, k=1, beta=1, threshold=0, test_size=0.2, dist_type='cosine', cv_type='random', early_stop=2, njobs=1, random_state=1234, show_plots=False, verbose=False)
Grid-search over hyperparameters, iteratively optimizing over one parameter at a time, and internally calling cv_training.
…
Parameters
- search_paramsdict
a dictionary which contains as keys the hyperparameter names and as values the corresponding intervals to explore during the grid-search
- templatestanscofi.BasicModel or subclass
type of model to train
- paramsdict
dictionary of parameters to initialize the model
- train_datasetstanscofi.Dataset
dataset to train upon
- metricstr
metric to optimize the model upon. Implemented metrics are in validation.py
- kint (default: 1)
Argument of the metric to optimize. Implemented metrics are in validation.py
- betafloat (default: 1)
Argument of the metric to optimize. Implemented metrics are in validation.py
- thresholdfloat (default: 0)
decision threshold
- test_sizefloat (default: 0.2)
percentage of testing set (if cv_type=”weakly_correlated”)
- dist_typestr (default: “cosine”)
type of metric for splitting (if cv_type=”weakly_correlated”)
- cv_typestr (default: “random”)
type of split to apply to the dataset. Can either be “random” or “weakly_correlated”
- njobsint (default: 1)
number of jobs to run in parallel. Should be lower than nsplits-1
- random_stateint (default: 1234)
random seed
- show_plotsbool (default: False)
shows the validation plots at each cross-validation step
- verbosebool (default: False)
prints out information
Returns
- best_paramsdict
a dictionary which contains as keys the hyperparameter names and as values the best values obtained across all grid-search steps
- best_modelsubinstance of stanscofi.models.BasicModel
the best trained model associated with the best parameters
- metricsdict
- a dictionary which contains
- “train_metric”float
the metric on the training set on the best crossvalidation split for the best set of parameters
- “test_metric”float
the metric on the testing set on the best crossvalidation split for the best set of parameters
- stanscofi.training_testing.indices_to_folds(indices, indices_array, shape)
Converts indices of datapoints into folds as defined in stanscofi
…
Parameters
- indicesarray-like of size (n_selected_ratings, )
flat indices of selected datapoints
- indices_arrayarray-like of size (n_total_ratings, 2)
corresponding row and column indices of datapoints
- shapetuple of integers of size 2
total numbers of rows and columns
Returns
- foldsCOO-array of shape shape
folds which can be fed to other functions in stanscofi, e.g., dataset.subset(folds)
- stanscofi.training_testing.random_cv_split(dataset, cv_generator, metric='cosine')
Splits the data into training and testing datasets randomly for cross-validation.
…
Parameters
- datasetstanscofi.Dataset
dataset to split
- cv_generatorscikit-learn cross-validation index generator
e.g. StratifiedKFold, KFold
- metricstr
metric to consider to assess distance between training and testing sets. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
Returns
- cv_foldslist of size nsplits of COO-array of shape (n_items, n_users)
list of arrays which contain values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set
- dist_lstlist of size nsplits of tuples of float of size 3
for each fold, minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set
- stanscofi.training_testing.random_simple_split(dataset, test_size, metric='cosine', random_state=1234)
Splits the data into training and testing datasets randomly.
…
Parameters
- datasetstanscofi.Dataset
dataset to split
- test_sizefloat
value between 0 and 1 (strictly) which indicates the maximum percentage of initial data (positive and negative ratings) being assigned to the test dataset
- metricstr
metric to consider to assess distance between training and testing sets. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
- random_stateint
random seed
Returns
- cv_foldslist of COO-array of shape (n_items, n_users)
list of arrays which contain values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set
- dist_train_test, dist_train, dist_testfloat
minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set
Splits the data into training and testing datasets with a low correlation among items, by applying a hierarchical clustering on the item feature matrix. NaNs in the item feature matrix are converted to 0.
…
Parameters
- datasetstanscofi.Dataset
dataset to split
- test_sizefloat
value between 0 and 1 (strictly) which indicates the maximum percentage of initial data (positive and negative ratings) being assigned to the test dataset
- early_stopint or None
positive integer, which stops the cluster number search after 3 tries yielding the same number; note that if early_stop is not None, then the property on test_size will not necessarily hold anymore
- metricstr
metric to consider to perform hierarchical clustering on the dataset. Should belong to [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
- random_stateint
random seed
- niterint
maximum number of iterations of the clustering loop
- verbosebool
prints out information
Returns
- train_folds, test_foldsCOO-array of shape (n_items, n_users)
an array which contains values in {0, 1} describing the unavailable and available user-item matchings in the training (resp. testing) set
- dist_train_test, dist_train, dist_testfloat
minimum nonzero distance between an element in the training and in the testing sets, resp. inside the training set, resp. inside the testing set