Preprocessing

class stanscofi.preprocessing.CustomScaler(posinf, neginf)

Bases: object

A class used to encode a simple preprocessing pipeline for feature matrices. Does mean imputation for features, feature filtering, correction of infinity errors and standardization

…

Parameters

posinfint: Value to replace infinity (positive) values
neginfint: Value to replace infinity (negative) values

Attributes

imputerNone or sklearn.impute.SimpleImputer instance: Class for imputation of values
scalerNone or sklearn.preprocessing.StandardScaler: Class for standardization of values
filterNone or list: List of selected features (Top-N in terms of variance)

Methods

__init__(params): Initialize the scaler (with unfitted attributes)
fit_transform(mat, subset=None, verbose=False): Fits classes and transforms a matrix

fit_transform(mat, subset=None, verbose=False)

Fits each attribute of the scaler and transform a feature matrix. Does mean imputation for features, feature filtering, correction of infinity errors and standardization

…

Parameters

matarray-like of shape (n_samples, n_features): matrix which should be preprocessed
subsetNone or int: number of features to keep in feature matrix (Top-N in variance); if it is None, attribute filter is either initialized (if it is equal to None) or used to filter features
verbosebool: prints out information

Returns

mat_nanarray-like of shape (n_samples, n_features): Preprocessed matrix

stanscofi.preprocessing.Perlman_procedure(dataset, njobs=1, sep_feature=None, missing=-666, inf=2, verbose=False)

Method for combining (several) item and user similarity matrices (reference DOI: 10.1089/cmb.2010.0213). Instead of concatenating item and user features for a given pair, resulting in a vector of size (n_items x n_item_matrices)+(n_users x n_user_matrices), compute a single score per pair of (item_matrix, user_matrix) for each (item, user) pair, resulting in a vector of size (n_item_matrices) x (n_user_matrices).

The score for any item i, user u, item-item similarity fi and user-user similarity fu is score_{fi,fu}(i,u) = max { sqrt(fi(dr, dr’) x fu(di’, di)) | (i’,u’)!=(i,u), fi(dr, dr’)!=NaN, fu(di’, di)!=NaN, rating(i’,u’)!=0 }

Then the final feature matrix is X = (X_{i,j})_{i,j} for i a (item, user) pair and j a (item similarity, user similarity) pair

…

Parameters

datasetstanscofi.Dataset

dataset which should be transformed, with n_items items (with n_item_features features) and n_users users (with n_user_features features) with the following attributes

ratingsCOO-array of shape (n_items, n_users): an array which contains values in {-1, 0, 1} describing the negative, unlabelled/unavailable, positive user-item matchings
itemsCOO-array of shape (n_item_features, n_items): concatenation of n_drug_features drug similarity matrices of shape (n_drugs, n_drugs), where values in item_features are denoted by “<feature><sep_feature><drug>” and missing values are denoted by numpy.nan; if the prefix in “<feature><sep_feature>” is missing, it is assumed that items is a single similarity matrix (n_item_matrices=1)
usersCOO-array of shape (n_user_features, n_users): concatenation of n_disease_features drug similarity matrices of shape (n_diseases, n_diseases), where values in user_features are denoted by “<feature><sep_feature><disease>” and missing values are denoted by numpy.nan; if the prefix in “<feature><sep_feature>” is missing, it is assumed that users is a single similarity matrix (n_user_matrices=1)

NaN values are replaced by 0, whereas infinite values are replaced by inf (parameter below).

njobsint

number of jobs to run in parallel

sep_featurestr or None

separator between feature type and element in the feature matrices in dataset. None if there is one single feature type expected

missingint

placeholder value that should be different from any feature name

infint

Value that replaces infinite values in the dataset (inf for +infinity, -inf for -infinity)

verbosebool

prints out information

Returns

Xarray-like of shape (n_items x n_users, n_item_features x n_user_features): the feature matrix
yarray-like of shape (n_items x n_users, ): the response/outcome vector

stanscofi.preprocessing.cartesian_product_transpose(*arrays)

stanscofi.preprocessing.meanimputation_standardize(dataset, subset=None, scalerS=None, scalerP=None, inf=10, verbose=False)

Computes a single feature matrix and response vector from a drug repurposing dataset, by imputation by the average value of a feature for missing values and by centering and standardizing user and item feature matrices and concatenating them

…

Parameters

datasetstanscofi.Dataset: dataset which should be transformed, with n_items items (with n_item_features features) and n_users users (with n_user_features features) where missing values are denoted by numpy.nan
subsetNone or int: number of features to keep in item feature matrix, and in user feature matrix (selecting the ones with highest variance)
scalerSNone or sklearn.preprocessing.StandardScaler instance: scaler for items
scalerPNone or sklearn.preprocessing.StandardScaler instance: scaler for users
verbosebool: prints out information

Returns

Xarray-like of shape (n_folds, n_item_features+n_user_features): the feature matrix
yarray-like of shape (n_folds, ): the response/outcome vector
scalerSNone or stanscofi.models.CustomScaler instance: scaler for items; if the input value was None, returns the scaler fitted on item feature vectors
scalerPNone or stanscofi.models.CustomScaler instance: scaler for users; if the input value was None, returns the scaler fitted on user feature vectors

stanscofi.preprocessing.preprocessing_XY(dataset, preprocessing_str, operator='*', sep_feature='-', subset_=None, filter_=None, scalerS=None, scalerP=None, inf=2, njobs=1)

Converts a score vector or a score value into a list of scores

…

Parameters

datasetstanscofi.datasets.Dataset: dataset to preprocess
preprocessing_strstr: type of preprocessing: in [“Perlman_procedure”,”meanimputation_standardize”,”same_feature_preprocessing”].
subset_None or int: Number of features to restrict the dataset to (Top-subset_ features in terms of cross-sample variance) /!across user and item features if preprocessing_str!=”meanimputation_standardize” otherwise 2*subset_ features are preserved (subset_ for item features, subset_ for user features)
operatorNone or str: arithmetric operation to apply, ex. “+”, “*”
sep_featurestr: separator between feature type and element in the feature matrices in dataset
filter_None or list: list of feature indices to keep (of length subset_) (overrides the subset_ parameter if both are fed)
scalerSNone or stanscofi.models.CustomScaler instance: scaler for items; the scaler fitted on item feature vectors
scalerPNone or stanscofi.models.CustomScaler instance: scaler for users; the scaler fitted on user feature vectors
inffloat or int: placeholder value for infinity values (positive : +inf, negative : -inf)
njobsint: number of jobs to run in parallel (njobs > 0) for the Perlman procedure

Returns

Xarray-like of shape (n_folds, n_features): the feature matrix
yarray-like of shape (n_folds, ): the response/outcome vector
scalerSNone or stanscofi.models.CustomScaler instance: scaler for items; if the input value was None, returns the scaler fitted on item feature vectors
scalerPNone or stanscofi.models.CustomScaler instance: scaler for users; if the input value was None, returns the scaler fitted on user feature vectors
filter_None or list: list of feature indices to keep (of length subset_)

stanscofi.preprocessing.same_feature_preprocessing(dataset, operator)

If the users and items have the same features in the dataset, then a simple way to combine the user and item feature matrices is to apply an element-wise arithmetic operator (*, +, etc.) to the feature vectors coefficient per coefficient.

…

Parameters

datasetstanscofi.Dataset: dataset which should be transformed, where n_item_features==n_user_features and dataset.same_item_user_features==True
operatorstr: arithmetric operation to apply, ex. “+”, “*”

Returns

Xarray-like of shape (n_folds, n_features): the feature matrix
yarray-like of shape (n_folds, ): the response/outcome vector