Preprocessing

class stanscofi.preprocessing.CustomScaler(posinf, neginf)

Bases: object

A class used to encode a simple preprocessing pipeline for feature matrices. Does mean imputation for features, feature filtering, correction of infinity errors and standardization

Parameters

posinfint

Value to replace infinity (positive) values

neginfint

Value to replace infinity (negative) values

Attributes

imputerNone or sklearn.impute.SimpleImputer instance

Class for imputation of values

scalerNone or sklearn.preprocessing.StandardScaler

Class for standardization of values

filterNone or list

List of selected features (Top-N in terms of variance)

Methods

__init__(params)

Initialize the scaler (with unfitted attributes)

fit_transform(mat, subset=None, verbose=False)

Fits classes and transforms a matrix

fit_transform(mat, subset=None, verbose=False)

Fits each attribute of the scaler and transform a feature matrix. Does mean imputation for features, feature filtering, correction of infinity errors and standardization

Parameters

matarray-like of shape (n_samples, n_features)

matrix which should be preprocessed

subsetNone or int

number of features to keep in feature matrix (Top-N in variance); if it is None, attribute filter is either initialized (if it is equal to None) or used to filter features

verbosebool

prints out information

Returns

mat_nanarray-like of shape (n_samples, n_features)

Preprocessed matrix

stanscofi.preprocessing.Perlman_procedure(dataset, njobs=1, sep_feature=None, missing=-666, inf=2, verbose=False)

Method for combining (several) item and user similarity matrices (reference DOI: 10.1089/cmb.2010.0213). Instead of concatenating item and user features for a given pair, resulting in a vector of size (n_items x n_item_matrices)+(n_users x n_user_matrices), compute a single score per pair of (item_matrix, user_matrix) for each (item, user) pair, resulting in a vector of size (n_item_matrices) x (n_user_matrices).

The score for any item i, user u, item-item similarity fi and user-user similarity fu is score_{fi,fu}(i,u) = max { sqrt(fi(dr, dr’) x fu(di’, di)) | (i’,u’)!=(i,u), fi(dr, dr’)!=NaN, fu(di’, di)!=NaN, rating(i’,u’)!=0 }

Then the final feature matrix is X = (X_{i,j})_{i,j} for i a (item, user) pair and j a (item similarity, user similarity) pair

Parameters

datasetstanscofi.Dataset
dataset which should be transformed, with n_items items (with n_item_features features) and n_users users (with n_user_features features) with the following attributes
ratingsCOO-array of shape (n_items, n_users)

an array which contains values in {-1, 0, 1} describing the negative, unlabelled/unavailable, positive user-item matchings

itemsCOO-array of shape (n_item_features, n_items)

concatenation of n_drug_features drug similarity matrices of shape (n_drugs, n_drugs), where values in item_features are denoted by “<feature><sep_feature><drug>” and missing values are denoted by numpy.nan; if the prefix in “<feature><sep_feature>” is missing, it is assumed that items is a single similarity matrix (n_item_matrices=1)

usersCOO-array of shape (n_user_features, n_users)

concatenation of n_disease_features drug similarity matrices of shape (n_diseases, n_diseases), where values in user_features are denoted by “<feature><sep_feature><disease>” and missing values are denoted by numpy.nan; if the prefix in “<feature><sep_feature>” is missing, it is assumed that users is a single similarity matrix (n_user_matrices=1)

NaN values are replaced by 0, whereas infinite values are replaced by inf (parameter below).

njobsint

number of jobs to run in parallel

sep_featurestr or None

separator between feature type and element in the feature matrices in dataset. None if there is one single feature type expected

missingint

placeholder value that should be different from any feature name

infint

Value that replaces infinite values in the dataset (inf for +infinity, -inf for -infinity)

verbosebool

prints out information

Returns

Xarray-like of shape (n_items x n_users, n_item_features x n_user_features)

the feature matrix

yarray-like of shape (n_items x n_users, )

the response/outcome vector

stanscofi.preprocessing.cartesian_product_transpose(*arrays)
stanscofi.preprocessing.meanimputation_standardize(dataset, subset=None, scalerS=None, scalerP=None, inf=10, verbose=False)

Computes a single feature matrix and response vector from a drug repurposing dataset, by imputation by the average value of a feature for missing values and by centering and standardizing user and item feature matrices and concatenating them

Parameters

datasetstanscofi.Dataset

dataset which should be transformed, with n_items items (with n_item_features features) and n_users users (with n_user_features features) where missing values are denoted by numpy.nan

subsetNone or int

number of features to keep in item feature matrix, and in user feature matrix (selecting the ones with highest variance)

scalerSNone or sklearn.preprocessing.StandardScaler instance

scaler for items

scalerPNone or sklearn.preprocessing.StandardScaler instance

scaler for users

verbosebool

prints out information

Returns

Xarray-like of shape (n_folds, n_item_features+n_user_features)

the feature matrix

yarray-like of shape (n_folds, )

the response/outcome vector

scalerSNone or stanscofi.models.CustomScaler instance

scaler for items; if the input value was None, returns the scaler fitted on item feature vectors

scalerPNone or stanscofi.models.CustomScaler instance

scaler for users; if the input value was None, returns the scaler fitted on user feature vectors

stanscofi.preprocessing.preprocessing_XY(dataset, preprocessing_str, operator='*', sep_feature='-', subset_=None, filter_=None, scalerS=None, scalerP=None, inf=2, njobs=1)

Converts a score vector or a score value into a list of scores

Parameters

datasetstanscofi.datasets.Dataset

dataset to preprocess

preprocessing_strstr

type of preprocessing: in [“Perlman_procedure”,”meanimputation_standardize”,”same_feature_preprocessing”].

subset_None or int

Number of features to restrict the dataset to (Top-subset_ features in terms of cross-sample variance) /!across user and item features if preprocessing_str!=”meanimputation_standardize” otherwise 2*subset_ features are preserved (subset_ for item features, subset_ for user features)

operatorNone or str

arithmetric operation to apply, ex. “+”, “*”

sep_featurestr

separator between feature type and element in the feature matrices in dataset

filter_None or list

list of feature indices to keep (of length subset_) (overrides the subset_ parameter if both are fed)

scalerSNone or stanscofi.models.CustomScaler instance

scaler for items; the scaler fitted on item feature vectors

scalerPNone or stanscofi.models.CustomScaler instance

scaler for users; the scaler fitted on user feature vectors

inffloat or int

placeholder value for infinity values (positive : +inf, negative : -inf)

njobsint

number of jobs to run in parallel (njobs > 0) for the Perlman procedure

Returns

Xarray-like of shape (n_folds, n_features)

the feature matrix

yarray-like of shape (n_folds, )

the response/outcome vector

scalerSNone or stanscofi.models.CustomScaler instance

scaler for items; if the input value was None, returns the scaler fitted on item feature vectors

scalerPNone or stanscofi.models.CustomScaler instance

scaler for users; if the input value was None, returns the scaler fitted on user feature vectors

filter_None or list

list of feature indices to keep (of length subset_)

stanscofi.preprocessing.same_feature_preprocessing(dataset, operator)

If the users and items have the same features in the dataset, then a simple way to combine the user and item feature matrices is to apply an element-wise arithmetic operator (*, +, etc.) to the feature vectors coefficient per coefficient.

Parameters

datasetstanscofi.Dataset

dataset which should be transformed, where n_item_features==n_user_features and dataset.same_item_user_features==True

operatorstr

arithmetric operation to apply, ex. “+”, “*”

Returns

Xarray-like of shape (n_folds, n_features)

the feature matrix

yarray-like of shape (n_folds, )

the response/outcome vector