Preprocessing
- class stanscofi.preprocessing.CustomScaler(posinf, neginf)
Bases:
object
A class used to encode a simple preprocessing pipeline for feature matrices. Does mean imputation for features, feature filtering, correction of infinity errors and standardization
…
Parameters
- posinfint
Value to replace infinity (positive) values
- neginfint
Value to replace infinity (negative) values
Attributes
- imputerNone or sklearn.impute.SimpleImputer instance
Class for imputation of values
- scalerNone or sklearn.preprocessing.StandardScaler
Class for standardization of values
- filterNone or list
List of selected features (Top-N in terms of variance)
Methods
- __init__(params)
Initialize the scaler (with unfitted attributes)
- fit_transform(mat, subset=None, verbose=False)
Fits classes and transforms a matrix
- fit_transform(mat, subset=None, verbose=False)
Fits each attribute of the scaler and transform a feature matrix. Does mean imputation for features, feature filtering, correction of infinity errors and standardization
…
Parameters
- matarray-like of shape (n_samples, n_features)
matrix which should be preprocessed
- subsetNone or int
number of features to keep in feature matrix (Top-N in variance); if it is None, attribute filter is either initialized (if it is equal to None) or used to filter features
- verbosebool
prints out information
Returns
- mat_nanarray-like of shape (n_samples, n_features)
Preprocessed matrix
- stanscofi.preprocessing.Perlman_procedure(dataset, njobs=1, sep_feature=None, missing=-666, inf=2, verbose=False)
Method for combining (several) item and user similarity matrices (reference DOI: 10.1089/cmb.2010.0213). Instead of concatenating item and user features for a given pair, resulting in a vector of size (n_items x n_item_matrices)+(n_users x n_user_matrices), compute a single score per pair of (item_matrix, user_matrix) for each (item, user) pair, resulting in a vector of size (n_item_matrices) x (n_user_matrices).
The score for any item i, user u, item-item similarity fi and user-user similarity fu is score_{fi,fu}(i,u) = max { sqrt(fi(dr, dr’) x fu(di’, di)) | (i’,u’)!=(i,u), fi(dr, dr’)!=NaN, fu(di’, di)!=NaN, rating(i’,u’)!=0 }
Then the final feature matrix is X = (X_{i,j})_{i,j} for i a (item, user) pair and j a (item similarity, user similarity) pair
…
Parameters
- datasetstanscofi.Dataset
- dataset which should be transformed, with n_items items (with n_item_features features) and n_users users (with n_user_features features) with the following attributes
- ratingsCOO-array of shape (n_items, n_users)
an array which contains values in {-1, 0, 1} describing the negative, unlabelled/unavailable, positive user-item matchings
- itemsCOO-array of shape (n_item_features, n_items)
concatenation of n_drug_features drug similarity matrices of shape (n_drugs, n_drugs), where values in item_features are denoted by “<feature><sep_feature><drug>” and missing values are denoted by numpy.nan; if the prefix in “<feature><sep_feature>” is missing, it is assumed that items is a single similarity matrix (n_item_matrices=1)
- usersCOO-array of shape (n_user_features, n_users)
concatenation of n_disease_features drug similarity matrices of shape (n_diseases, n_diseases), where values in user_features are denoted by “<feature><sep_feature><disease>” and missing values are denoted by numpy.nan; if the prefix in “<feature><sep_feature>” is missing, it is assumed that users is a single similarity matrix (n_user_matrices=1)
NaN values are replaced by 0, whereas infinite values are replaced by inf (parameter below).
- njobsint
number of jobs to run in parallel
- sep_featurestr or None
separator between feature type and element in the feature matrices in dataset. None if there is one single feature type expected
- missingint
placeholder value that should be different from any feature name
- infint
Value that replaces infinite values in the dataset (inf for +infinity, -inf for -infinity)
- verbosebool
prints out information
Returns
- Xarray-like of shape (n_items x n_users, n_item_features x n_user_features)
the feature matrix
- yarray-like of shape (n_items x n_users, )
the response/outcome vector
- stanscofi.preprocessing.cartesian_product_transpose(*arrays)
- stanscofi.preprocessing.meanimputation_standardize(dataset, subset=None, scalerS=None, scalerP=None, inf=10, verbose=False)
Computes a single feature matrix and response vector from a drug repurposing dataset, by imputation by the average value of a feature for missing values and by centering and standardizing user and item feature matrices and concatenating them
…
Parameters
- datasetstanscofi.Dataset
dataset which should be transformed, with n_items items (with n_item_features features) and n_users users (with n_user_features features) where missing values are denoted by numpy.nan
- subsetNone or int
number of features to keep in item feature matrix, and in user feature matrix (selecting the ones with highest variance)
- scalerSNone or sklearn.preprocessing.StandardScaler instance
scaler for items
- scalerPNone or sklearn.preprocessing.StandardScaler instance
scaler for users
- verbosebool
prints out information
Returns
- Xarray-like of shape (n_folds, n_item_features+n_user_features)
the feature matrix
- yarray-like of shape (n_folds, )
the response/outcome vector
- scalerSNone or stanscofi.models.CustomScaler instance
scaler for items; if the input value was None, returns the scaler fitted on item feature vectors
- scalerPNone or stanscofi.models.CustomScaler instance
scaler for users; if the input value was None, returns the scaler fitted on user feature vectors
- stanscofi.preprocessing.preprocessing_XY(dataset, preprocessing_str, operator='*', sep_feature='-', subset_=None, filter_=None, scalerS=None, scalerP=None, inf=2, njobs=1)
Converts a score vector or a score value into a list of scores
…
Parameters
- datasetstanscofi.datasets.Dataset
dataset to preprocess
- preprocessing_strstr
type of preprocessing: in [“Perlman_procedure”,”meanimputation_standardize”,”same_feature_preprocessing”].
- subset_None or int
Number of features to restrict the dataset to (Top-subset_ features in terms of cross-sample variance) /!across user and item features if preprocessing_str!=”meanimputation_standardize” otherwise 2*subset_ features are preserved (subset_ for item features, subset_ for user features)
- operatorNone or str
arithmetric operation to apply, ex. “+”, “*”
- sep_featurestr
separator between feature type and element in the feature matrices in dataset
- filter_None or list
list of feature indices to keep (of length subset_) (overrides the subset_ parameter if both are fed)
- scalerSNone or stanscofi.models.CustomScaler instance
scaler for items; the scaler fitted on item feature vectors
- scalerPNone or stanscofi.models.CustomScaler instance
scaler for users; the scaler fitted on user feature vectors
- inffloat or int
placeholder value for infinity values (positive : +inf, negative : -inf)
- njobsint
number of jobs to run in parallel (njobs > 0) for the Perlman procedure
Returns
- Xarray-like of shape (n_folds, n_features)
the feature matrix
- yarray-like of shape (n_folds, )
the response/outcome vector
- scalerSNone or stanscofi.models.CustomScaler instance
scaler for items; if the input value was None, returns the scaler fitted on item feature vectors
- scalerPNone or stanscofi.models.CustomScaler instance
scaler for users; if the input value was None, returns the scaler fitted on user feature vectors
- filter_None or list
list of feature indices to keep (of length subset_)
- stanscofi.preprocessing.same_feature_preprocessing(dataset, operator)
If the users and items have the same features in the dataset, then a simple way to combine the user and item feature matrices is to apply an element-wise arithmetic operator (*, +, etc.) to the feature vectors coefficient per coefficient.
…
Parameters
- datasetstanscofi.Dataset
dataset which should be transformed, where n_item_features==n_user_features and dataset.same_item_user_features==True
- operatorstr
arithmetric operation to apply, ex. “+”, “*”
Returns
- Xarray-like of shape (n_folds, n_features)
the feature matrix
- yarray-like of shape (n_folds, )
the response/outcome vector