CIMtools.applicability_domain package

class CIMtools.applicability_domain.Box

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

This approach defines AD as a bounding block, which is an N-dimensional hypercube defined on the basis of the maximum and minimum values of each descriptor used to construct the model. If test compound is outside of hypercube it is outside of AD model. The method doesn’t have internal parameters, threshold.

fit(X, y=None)

Find min and max values of every feature.

Parameters
  • X ({array-like, sparse matrix}, shape (n_samples, n_features)) – The training input samples.

  • y (Ignored) – not used, present for API consistency by convention.

Returns

self

Return type

object

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

predict(X)

Predict if a particular sample is an outlier or not.

Parameters

X (array-like or sparse matrix, shape (n_samples, n_features)) – The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

Returns

is_inlier – For each observations, tells whether or not (True or False) it should be considered as an inlier according to the fitted model.

Return type

array, shape (n_samples,)

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – Mean accuracy of self.predict(X) wrt. y.

Return type

float

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

class CIMtools.applicability_domain.GPR_AD(threshold='cv', score='ba_ad', gpr_model=None)

Bases: sklearn.base.BaseEstimator

Gaussian Process Regression (GPR) assumes that the joint distribution of a real-valued property of chemical reactions and their descriptors is multivariate normal (Gaussian) with the elements of its covariance matrix computed by means of special covariance functions (kernels). For every reaction, a GPR model produces using the Bayes’ theorem a posterior conditional distribution (so-called prediction density) of the reaction property given the vector of reaction descriptors. The prediction density has normal (Gaussian) distribution with the mean corresponding to predicted value of the property and the variance corresponding to prediction confidence [1]. If the variance is greater than a predefined threshold σ*, the chemical reaction is considered as X-outlier (out of AD)

fit(X, y=None)

Model building and threshold searching During training, a model is built and a ariance threshold σ* is found by which the object is considered to belong to the applicability domain of the model.

Parameters
  • X (array-like or sparse matrix, shape (n_samples, n_features)) – The input samples. Use dtype=np.float32 for maximum efficiency.

  • y (array-like, shape = [n_samples] or [n_samples, n_outputs]) – The target values (real numbers in regression).

Returns

self

Return type

object

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

predict(X)

Predict inside or outside AD for X.

Parameters

X (array-like or sparse matrix, shape (n_samples, n_features)) – The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

Returns

ad – Array contains True (reaction in AD) and False (reaction residing outside AD).

Return type

array of shape = [n_samples]

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

class CIMtools.applicability_domain.Leverage(threshold='auto', score='ba_ad', reg_model=None)

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Distance-based method The model space can be represented by a two-dimensional matrix comprising n chemicals (rows) and k variables (columns), called the descriptor matrix (X). The leverage of a chemical provides a measure of the distance of the chemical from the centroid of X. Chemicals close to the centroid are less influential in model building than are extreme points. The leverages of all chemicals in the data set are generated by manipulating X according to Equation 1, to give the so-called Influence Matrix or Hat Matrix (H).

H = X(XTX)–1 XT (Equation 1)

where X is the descriptor matrix, XT is the transpose of X, and (A)–1 is the inverse of matrix A, where A = (XTX).

The leverages or hat values (hi) of the chemicals (i) in the descriptor space are the diagonal elements of H, and can be computed by Equation 2.

hii = xiT(XTX)–1 xi (Equation 2)

where xi is the descriptor row-vector of the query chemical. A “warning leverage” (h*) is generally (!) fixed at 3p/n, where n is the number of training chemicals, and p the number of model variables plus one.

A “warning leverage” can be found on internal cross-validation.

A chemical with high leverage in the training set greatly influences the regression line: the fitted regression line is forced near to the observed value and its residual (observed-predicted value) is small, so the chemical does not appear to be an outlier, even though it may actually be outside the AD. In contrast, if a chemical in the test set has a hat value greater than the warning leverage h*, this means that the prediction is the result of substantial extrapolation and therefore may not be reliable.

fit(X, y=None)

Learning is to find the inverse matrix for X and calculate the threshold. All AD’s model hyperparameters were selected based on internal cross-validation using training set. The hyperparameters of the AD definition approach have been optimized in the cross-validation, where metrics RMSE_AD or BA_AD were used as maximized scoring functions.

Parameters
  • X (array-like or sparse matrix, shape (n_samples, n_features)) – The input samples. Use dtype=np.float32 for maximum efficiency.

  • y (array-like, shape = [n_samples] or [n_samples, n_outputs]) – The target values (real numbers in regression).

Returns

self

Return type

object

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

predict(X)

Predict inside or outside AD for X.

Parameters

X (array-like or sparse matrix, shape (n_samples, n_features)) – The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

Returns

ad – Array contains True (reaction in AD) and False (reaction residing outside AD).

Return type

array of shape = [n_samples]

predict_proba(X)

Predict the distances for X to center of the training set.

Parameters

X (array-like or sparse matrix, shape (n_samples, n_features)) – The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

Returns

leverages – The objects distances to center of the training set.

Return type

array of shape = [n_samples]

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – Mean accuracy of self.predict(X) wrt. y.

Return type

float

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

class CIMtools.applicability_domain.ReactionTypeControl(env=0)

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Reaction Type Control (RTC) is performed using reaction signature.

The signature includes both the reaction centre itself and its nearest environment up to {env} Since the reaction signature is not a very clear term, we considered the environment parameter as a hyper-parameter. Therefore, the method has one internal parameter. If the environment is 0, then the reaction signature considers only the atoms at which the change occurs. If environment = 1, the first circle neighbours included in the reaction signature, if environment = 2 - the second environment, and so on up to the whole reaction (env=’all’). In addition, by default, all atoms put a label on their hybridization. Reaction is considered belonging to model’s AD if its reaction signature coincides with ones used in training set.

fit(X)

Fit structure-based AD. The training model memorizes the unique set of reaction signature.

Parameters

X (after read rdf file) –

Returns

self

Return type

object

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

predict(X)

Reaction is considered belonging to model’s AD if its reaction signature coincides with ones used in training set.

Parameters

X (after read rdf file) –

Returns

a

Return type

array contains True (reaction in AD) and False (reaction residing outside AD)

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – Mean accuracy of self.predict(X) wrt. y.

Return type

float

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

class CIMtools.applicability_domain.SimilarityDistance(leaf_size=40, metric='minkowski', score='ba_ad', threshold='auto', reg_model=None)

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Distance-based method for defining applicability domain (AD).

In the case of non-linear kNN QSPR method, since the models are based on chemical similarity calculations, a large similarity distance could signal query compounds too dissimilar to the training set compounds. This approach is based on providing similarity measure for a new chemical with respect to the compounds within the training space. The similarity is identified by finding the distance of a query chemical from the nearest training compound or its distances from k nearest neighbors in the training set. If the calculated distance values of test set compounds are not within the user-defined threshold set by the training set molecules, then the prediction of these compounds are considered to be unreliable. Commonly threshold calculated like Dc=Zσ + <y>, where <y> is the average and σ is the standard deviation of the Euclidean distances of the k nearest neighbors of each compound in the training set and Z is an empirical parameter to control the significance level, with the default value of 0.5.

Drawback of method is lack of strict rules in literature towards defining the thresholds can lead to ambiguous results. We propose a variation of finding threshold. Threshold in the approach was optimized in course internal cross-validation procedure by maximize our metric.

NB! To the nearest first neighbor

Parameters
  • leaf_size (positive integer (default = 40)) – Number of points at which to switch to brute-force. Changing leaf_size will not affect the results of a query, but can significantly impact the speed of a query and the memory required to store the constructed tree. The amount of memory needed to store the tree scales as approximately n_samples / leaf_size. For a specified leaf_size, a leaf node is guaranteed to satisfy leaf_size <= n_points <= 2 * leaf_size, except in the case that n_samples < leaf_size.

  • metric (string or DistanceMetric object) –

    The distance metric to use for the tree. Default=’minkowski’ with p=2 (that is, a euclidean metric). See the documentation of the DistanceMetric class for a list of available metrics. ball_tree.valid_metrics gives

    a list of the metrics which are valid for BallTree.

  • threshold (string or float) –

    It needs to compare the distance values with threshold. If the calculated distance values of test set compounds are not within the threshold set by the training set molecules, then the prediction of these compounds are considered to be unreliable.

    • If auto, threshold calculated like Dc = Zσ + <y>, where <y> is the average and σ is the standard deviation of

      the Euclidean distances of the k nearest neighbors of each compound in the training set and Z is an empirical parameter to control the significance level, with the default value of 0.5.

    • If ‘cv’, threshold in the approach is optimized in course internal cross-validation procedure

      by maximize our metric.

    • IF float, threshold will be this value

  • score (string) –

    A metric is required to find a threshold.

    • If score is ‘ba_ad’ is calculated balanced accuracy. The true inliers and outliers are those for

      which the difference in the prediction error is less than 3 RMSE

    • If score is ‘rmse_ba’ is calculated Root Mean Squared Error of model with AD. Sahigata and etc proposed [1]

      to use difference between root mean squared error outliers and inliers (RMSE_AD), which shows what is predicted better: objects outside AD or objects inside and outside AD. The metric characterizes how accurate the model becomes. By inliers, we mean objects inside AD, and by outliers, objects outside AD.

  • reg_model (None or estimator) – It needs for finding threshold

  • ----

  • Sahigara F. ([1]) –

  • K. (Mansouri) –

  • D. (Ballabio) –

  • A. (Mauri) –

  • V. Todeschini R. Comparison of Different Approaches (Consonni) –

  • Define the Applicability Domain of QSAR Models. Molecules (to) –

  • 2012

  • 17 (vol.) –

  • 4791-4810. (pp.) –

  • doi (10.3390/molecules17054791.) –

fit(X, y=None)

Fit distance-based AD. All AD’s model hyperparameters were selected based on internal cross-validation using training set. The hyperparameters of the AD definition approach have been optimized in the cross-validation, where metrics RMSE_AD or BA_AD were used as maximized scoring functions.

Parameters

X (array-like or sparse matrix, shape (n_samples, n_features)) – The input samples. Use dtype=np.float32 for maximum efficiency.

Returns

self – Returns self.

Return type

object

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

predict(X)

Predict if a particular sample is an outlier or not.

Parameters

X (array-like or sparse matrix, shape (n_samples, n_features)) – The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

Returns

y – For each observations, tells whether or not (True or False) it should be considered as an inlier according to the fitted model.

Return type

array, shape (n_samples,)

predict_proba(X)

Returns the value of the nearest neighbor from the training set.

Parameters

X (array-like or sparse matrix, shape (n_samples, n_features)) – The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

Returns

y

Return type

array, shape (n_samples,)

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – Mean accuracy of self.predict(X) wrt. y.

Return type

float

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

class CIMtools.applicability_domain.TwoClassClassifiers(threshold='cv', score='ba_ad', reg_model=None, clf_model=None)

Bases: sklearn.base.BaseEstimator

Model learns to distinguish inliers from outliers. Objects with high prediction error in cross-validation (more than 3xRMSE) are considered outliers, while the rest are inliers. Two-class classification methods is trained to distinguish them, and provides the value of confidence that object belongs to inliers. The latter is used as a measure that object is in AD. In this case, Random Forest Classifier implemented in scikit-learn library is used. The method requires fitting of two hyperparameters: max_features and probability threshold P*. If the object’s predicted probability of belonging to the inliers is greater than P*, its prediction is considered reliable (within AD). Other hyperparameters of Random Forest Classifier were set to defaults, except number of decision trees in RF was set to 500.

fit(X, y=None)

Model building and threshold searching During training, a model is built and a probability threshold is found by which the object is considered to belong to the applicability domain of the model. For this reason, in fit method we pass the following parameters: reg_model and clf_model. Reg_model is regression model, clf_model is classification model.

Parameters
  • X (array-like or sparse matrix, shape (n_samples, n_features)) – The input samples. Use dtype=np.float32 for maximum efficiency.

  • y (array-like, shape = [n_samples] or [n_samples, n_outputs]) – The target values (real numbers in regression).

Returns

self

Return type

object

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

predict(X)

Predict inside or outside AD for X.

Parameters

X (array-like or sparse matrix, shape (n_samples, n_features)) – The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

Returns

ad – Array contains True (reaction in AD) and False (reaction residing outside AD).

Return type

array of shape = [n_samples]

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance