CIMtools.datasets package

CIMtools.datasets.load_da(*, return_X_y=False, as_frame=False)

Load and return Diels-Alder reactions (DA) reactions dataset (regression).

Samples total

1866

Data

reactions, type: ReactionContainer

Targets

real logK (-8.511) - 8.568, type: float

Read more in the article: Madzhidov, T.I.; Gimadiev, T.R.; Malakhova, D.A.; Nugmanov, R.I.; Baskin, I.I.; Antipin, I.S.; Varnek, A. Structure-Reactivity modelling for Diels-alder reactions based on the condensed REACTION graph approach. J. Struct. Chem. 2017, 58, 685–691, doi:10.1134/S0022476617040023.

return_X_ybool, default=False

If True, returns (data, target) instead of a Bunch object.

as_framebool, default=False

If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is

a pandas DataFrame or Series depending on the number of target columns.

If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below.

dataBunch

Dictionary-like object, with the following attributes.

datandarray of shape (1866, )

The data array.

targetndarray of shape (1866, )

The regression target (logarithm of the reaction rate constant).

feature_names: list

The name of the dataset ([‘DA reactions’]).

target_names: list

The name of target ([‘logK’]).

frame: DataFrame of shape (1866, 2)

Only present when as_frame=True. DataFrame with data and target.

(data, target) : tuple if return_X_y is True

CIMtools.datasets.load_e2(*, return_X_y=False, as_frame=False)

Load and return bimolecular elimination (E2) reactions dataset (regression).

Samples total

1820

Data

reactions, type: ReactionContainer

Targets

real logK (-7.23) - 2.67, type: float

Read more in the article: Madzhidov, T.I.; Bodrov, A.V.; Gimadiev, T.R.; Nugmanov, R.I.; Antipin, I.S.; Varnek, A. Structure–reactivity relationship in bimolecular elimination reactions based on the condensed graph of a reaction. J. Struct. Chem. 2015, 56, 1227–1234, doi:10.1134/S002247661507001X.

Parameters
  • return_X_y (bool, default=False) – If True, returns (data, target) instead of a Bunch object.

  • as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below.

Returns

  • data (Bunch) – Dictionary-like object, with the following attributes.

    datandarray of shape (1820, )

    The data array.

    targetndarray of shape (1820, )

    The regression target (logarithm of the reaction rate constant).

    feature_names: list

    The name of the dataset ([‘E2 reactions’]).

    target_names: list

    The name of target ([‘logK’]).

    frame: DataFrame of shape (1820, 2)

    Only present when as_frame=True. DataFrame with data and target.

  • (data, target) (tuple if return_X_y is True)

CIMtools.datasets.load_nicklaus_tautomers(*, return_X_y=False, as_frame=False, as_regression=False)

Load and return Nicklaus’s tautomers dataset (Regression and Classification).

Samples total

5960

Samples Regression

2824

Data

molecules, type: MoleculeContainer

Targets

real ratio 0.0 - 1.0, type: float (Regression)

Classes

5

Molecules has .meta attribute which returns dict with additional data: structure_id: row in original file tautomer_id: id of structure of tautomer in row additive.{n}: solvent name. {n} started from 1 id of solvent. in mixtures will be presented more additive keys.

e.g. additive.2, additive.3 …

amount.{n}: amount of additive. prevalence (optional): Qualitative category of tautomer reported in the publication. temperature (optional) in Kelvin pH (optional)

For Regression: The numeric proportion of tautomer based on its quantitative ratio and qualitative prevalence.

For Classification: Quantitative ratio of tautomer compared to other tautomers.

Numeric classification of qualitative prevalence: 0: Not observed 1: Less favored, less stable, minor, observed 2: Equally, favored, major, in equilibrium, preferred, similar spectra 3: More favored, more stable, predominant, strongly favored 4: Exclusively observed, only observed, only tautomer, identical tautomer

Numeric classification of quantitative amount of tautomers: 0: ratio = 0.0 - 0.0099 1: ratio = 0.01 - 0.30 2: ratio = 0.31 - 0.69 3: ratio = 0.70 - 0.99 4: ratio = 1

Parameters
  • return_X_y (bool, default=False) – If True, returns (data, target) instead of a Bunch object.

  • as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below.

  • as_regression (bool, default=False) – If True, returns regression subset instead of classes

Returns

  • data (Bunch) – Dictionary-like object, with the following attributes.

    datandarray of shape (n, )

    The data array.

    targetndarray of shape (n, )

    The regression or classification target.

    feature_names: list

    The name of the dataset (‘Tautomers’).

    target_names: list

    The name of target ([‘ratio or category’]).

    frame: DataFrame of shape (n, 2)

    Only present when as_frame=True. DataFrame with data and target.

  • (data, target) (tuple if return_X_y is True)

CIMtools.datasets.load_sn2(*, return_X_y=False, as_frame=False)

Load and return bimolecular nucleophilic substitution (SN2) reactions dataset (regression).

Samples total

4830

Data

reactions, type: ReactionContainer

Targets

real logK (-7.68) - 1.65, type: float

Read more in the article: Gimadiev, T.; Madzhidov, T.; Tetko, I.; Nugmanov, R.; Casciuc, I.; Klimchuk, O.; Bodrov, A.; Polishchuk, P.; Antipin, I.; Varnek, A. Bimolecular Nucleophilic Substitution Reactions: Predictive Models for Rate Constants and Molecular Reaction Pairs Analysis. J. Mol. Inf. 2018, 38, 1800104, doi:10.1002/minf.201800104.

Parameters
  • return_X_y (bool, default=False) – If True, returns (data, target) instead of a Bunch object.

  • as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below.

Returns

  • data (Bunch) – Dictionary-like object, with the following attributes.

    datandarray of shape (4830, )

    The data array.

    targetndarray of shape (4830, )

    The regression target (logarithm of the reaction rate constant).

    feature_names: list

    The name of the dataset (‘SN2 reactions’).

    target_names: list

    The name of target ([‘logK’]).

    frame: DataFrame of shape (4830, 2)

    Only present when as_frame=True. DataFrame with data and target.

  • (data, target) (tuple if return_X_y is True)

CIMtools.datasets.molconvert_chemaxon(data)

ChemAxon molconvert wrapper.

Parameters

data (Buffer or string or path to file) – All supported by molconvert formats for chemical data storing.

Returns

array – CGRtools data types for Reactions and Molecules storing.

Return type

Array of molecules of reactions