CIMtools.model_selection package¶
- class CIMtools.model_selection.LeaveOneGroupOut¶
Bases:
sklearn.model_selection._split.BaseCrossValidator
Leave-One-Group-Out cross-validator
Provides train/test indices to split data in train/test sets. Each reactions with the same condition is used once as a test set (singleton) while the remaining reactions form the training set. Test set includes only reactions with transformations that appeared in other reactions.
- get_n_splits(X=None, y=None, groups=None)¶
Returns the number of splitting iterations in the cross-validator :param X: Always ignored, exists for compatibility.
np.zeros(n_samples)
may be used as a placeholder.- Parameters
y (object) – Always ignored, exists for compatibility.
np.zeros(n_samples)
may be used as a placeholder.groups (array-like, with shape (n_samples,)) – Group labels for the samples used while splitting the dataset into train/test set.
- Returns
n_splits – Returns the number of splitting iterations in the cross-validator.
- Return type
int
- split(X, y=None, groups=None)¶
Generate indices to split data into training and test set. :param X: Training data, includes reaction’s containers :type X: array-like, of length n_samples :param y: The target variable for supervised learning problems. :type y: array-like, of length n_samples :param groups: Group labels for the samples used while splitting the dataset into
train/test set.
- Yields
train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.
- class CIMtools.model_selection.TransformationOut(n_splits=5, n_repeats=1, shuffle=False, random_state=None)¶
Bases:
sklearn.model_selection._split.BaseCrossValidator
Transformation-out cross-validator
Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Every fold includes all reactions of each transformation. Each fold is then used once as a validation (test set) while the k - 1 remaining folds form the training set. Test set includes only reactions with conditions that appeared in other reactions. This algorithm repeats n times with different randomization in each repetition.
- Parameters
n_splits (int, default=5) – Number of folds. Must be at least 2.
n_repeats (int, default=1) – Number of times cross-validator needs to be repeated.
shuffle (boolean, optional) – Whether to shuffle the data before splitting into batches.
random_state (int, RandomState instance or None, optional, default=None) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when
shuffle
== True.
- get_n_splits(X=None, y=None, groups=None)¶
Returns the number of splitting iterations in the cross-validator :param X: Always ignored, exists for compatibility.
np.zeros(n_samples)
may be used as a placeholder.- Parameters
y (object) – Always ignored, exists for compatibility.
np.zeros(n_samples)
may be used as a placeholder.groups (array-like, with shape (n_samples,)) – Group labels for the samples used while splitting the dataset into train/test set.
- Returns
n_splits – Returns the number of splitting iterations in the cross-validator.
- Return type
int
- split(X, y=None, groups=None)¶
Generate indices to split data into training and test set. :param X: Training data, includes reaction’s containers :type X: array-like, of length n_samples :param y: The target variable for supervised learning problems. :type y: array-like, of length n_samples :param groups: Group labels for the samples used while splitting the dataset into
train/test set.
- Yields
train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.
- CIMtools.model_selection.rtc_env_selection(X, y, data, envs, reg_model, score)¶
Function for finding the best number of neighbours in ReactionTypeControl method.
All AD’s model hyperparameters were selected based on internal cross-validation using training set. The hyperparameters of the AD definition approach have been optimized in the cross-validation, where metrics RMSE_AD or BA_AD were used as maximized scoring functions.
- Parameters
X –
array-like or sparse matrix, shape (n_samples, n_features) The input samples. Internally, it will be converted to
dtype=np.float32
and if a sparse matrix is providedto a sparse
csr_matrix
.y – array-like, shape = [n_samples] or [n_samples, n_outputs] The target values (real numbers in regression).
data – after read rdf file
envs – list or tuple. Numbers of neighbours.
reg_model – estimator
score – ‘ba_ad’ or ‘rmse_ad’
- Returns
int