CIMtools.model_selection package

class CIMtools.model_selection.LeaveOneGroupOut

Bases: sklearn.model_selection._split.BaseCrossValidator

Leave-One-Group-Out cross-validator

Provides train/test indices to split data in train/test sets. Each reactions with the same condition is used once as a test set (singleton) while the remaining reactions form the training set. Test set includes only reactions with transformations that appeared in other reactions.

get_n_splits(X=None, y=None, groups=None)

Returns the number of splitting iterations in the cross-validator :param X: Always ignored, exists for compatibility.

np.zeros(n_samples) may be used as a placeholder.

Parameters
  • y (object) – Always ignored, exists for compatibility. np.zeros(n_samples) may be used as a placeholder.

  • groups (array-like, with shape (n_samples,)) – Group labels for the samples used while splitting the dataset into train/test set.

Returns

n_splits – Returns the number of splitting iterations in the cross-validator.

Return type

int

split(X, y=None, groups=None)

Generate indices to split data into training and test set. :param X: Training data, includes reaction’s containers :type X: array-like, of length n_samples :param y: The target variable for supervised learning problems. :type y: array-like, of length n_samples :param groups: Group labels for the samples used while splitting the dataset into

train/test set.

Yields
  • train (ndarray) – The training set indices for that split.

  • test (ndarray) – The testing set indices for that split.

class CIMtools.model_selection.TransformationOut(n_splits=5, n_repeats=1, shuffle=False, random_state=None)

Bases: sklearn.model_selection._split.BaseCrossValidator

Transformation-out cross-validator

Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Every fold includes all reactions of each transformation. Each fold is then used once as a validation (test set) while the k - 1 remaining folds form the training set. Test set includes only reactions with conditions that appeared in other reactions. This algorithm repeats n times with different randomization in each repetition.

Parameters
  • n_splits (int, default=5) – Number of folds. Must be at least 2.

  • n_repeats (int, default=1) – Number of times cross-validator needs to be repeated.

  • shuffle (boolean, optional) – Whether to shuffle the data before splitting into batches.

  • random_state (int, RandomState instance or None, optional, default=None) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when shuffle == True.

get_n_splits(X=None, y=None, groups=None)

Returns the number of splitting iterations in the cross-validator :param X: Always ignored, exists for compatibility.

np.zeros(n_samples) may be used as a placeholder.

Parameters
  • y (object) – Always ignored, exists for compatibility. np.zeros(n_samples) may be used as a placeholder.

  • groups (array-like, with shape (n_samples,)) – Group labels for the samples used while splitting the dataset into train/test set.

Returns

n_splits – Returns the number of splitting iterations in the cross-validator.

Return type

int

split(X, y=None, groups=None)

Generate indices to split data into training and test set. :param X: Training data, includes reaction’s containers :type X: array-like, of length n_samples :param y: The target variable for supervised learning problems. :type y: array-like, of length n_samples :param groups: Group labels for the samples used while splitting the dataset into

train/test set.

Yields
  • train (ndarray) – The training set indices for that split.

  • test (ndarray) – The testing set indices for that split.

CIMtools.model_selection.rtc_env_selection(X, y, data, envs, reg_model, score)

Function for finding the best number of neighbours in ReactionTypeControl method.

All AD’s model hyperparameters were selected based on internal cross-validation using training set. The hyperparameters of the AD definition approach have been optimized in the cross-validation, where metrics RMSE_AD or BA_AD were used as maximized scoring functions.

Parameters
  • X

    array-like or sparse matrix, shape (n_samples, n_features) The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided

    to a sparse csr_matrix.

  • y – array-like, shape = [n_samples] or [n_samples, n_outputs] The target values (real numbers in regression).

  • data – after read rdf file

  • envs – list or tuple. Numbers of neighbours.

  • reg_model – estimator

  • score – ‘ba_ad’ or ‘rmse_ad’

Returns

int