CIMtools.datasets package
- CIMtools.datasets.load_da(*, return_X_y=False, as_frame=False)
Load and return Diels-Alder reactions (DA) reactions dataset (regression).
Samples total
1866
Data
reactions, type: ReactionContainer
Targets
real logK (-8.511) - 8.568, type: float
Read more in the article: Madzhidov, T.I.; Gimadiev, T.R.; Malakhova, D.A.; Nugmanov, R.I.; Baskin, I.I.; Antipin, I.S.; Varnek, A. Structure-Reactivity modelling for Diels-alder reactions based on the condensed REACTION graph approach. J. Struct. Chem. 2017, 58, 685–691, doi:10.1134/S0022476617040023.
- return_X_ybool, default=False
If True, returns
(data, target)
instead of a Bunch object.- as_framebool, default=False
If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is
- a pandas DataFrame or Series depending on the number of target columns.
If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below.
- data
Bunch
Dictionary-like object, with the following attributes.
- datandarray of shape (1866, )
The data array.
- targetndarray of shape (1866, )
The regression target (logarithm of the reaction rate constant).
- feature_names: list
The name of the dataset ([‘DA reactions’]).
- target_names: list
The name of target ([‘logK’]).
- frame: DataFrame of shape (1866, 2)
Only present when as_frame=True. DataFrame with data and target.
(data, target) : tuple if
return_X_y
is True- data
- CIMtools.datasets.load_e2(*, return_X_y=False, as_frame=False)
Load and return bimolecular elimination (E2) reactions dataset (regression).
Samples total
1820
Data
reactions, type: ReactionContainer
Targets
real logK (-7.23) - 2.67, type: float
Read more in the article: Madzhidov, T.I.; Bodrov, A.V.; Gimadiev, T.R.; Nugmanov, R.I.; Antipin, I.S.; Varnek, A. Structure–reactivity relationship in bimolecular elimination reactions based on the condensed graph of a reaction. J. Struct. Chem. 2015, 56, 1227–1234, doi:10.1134/S002247661507001X.
- Parameters
return_X_y (bool, default=False) – If True, returns
(data, target)
instead of a Bunch object.as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below.
- Returns
data (
Bunch
) – Dictionary-like object, with the following attributes.- datandarray of shape (1820, )
The data array.
- targetndarray of shape (1820, )
The regression target (logarithm of the reaction rate constant).
- feature_names: list
The name of the dataset ([‘E2 reactions’]).
- target_names: list
The name of target ([‘logK’]).
- frame: DataFrame of shape (1820, 2)
Only present when as_frame=True. DataFrame with data and target.
(data, target) (tuple if
return_X_y
is True)
- CIMtools.datasets.load_nicklaus_tautomers(*, return_X_y=False, as_frame=False, as_regression=False)
Load and return Nicklaus’s tautomers dataset (Regression and Classification).
Samples total
5960
Samples Regression
2824
Data
molecules, type: MoleculeContainer
Targets
real ratio 0.0 - 1.0, type: float (Regression)
Classes
5
Molecules has .meta attribute which returns dict with additional data: structure_id: row in original file tautomer_id: id of structure of tautomer in row additive.{n}: solvent name. {n} started from 1 id of solvent. in mixtures will be presented more additive keys.
e.g. additive.2, additive.3 …
amount.{n}: amount of additive. prevalence (optional): Qualitative category of tautomer reported in the publication. temperature (optional) in Kelvin pH (optional)
For Regression: The numeric proportion of tautomer based on its quantitative ratio and qualitative prevalence.
For Classification: Quantitative ratio of tautomer compared to other tautomers.
Numeric classification of qualitative prevalence: 0: Not observed 1: Less favored, less stable, minor, observed 2: Equally, favored, major, in equilibrium, preferred, similar spectra 3: More favored, more stable, predominant, strongly favored 4: Exclusively observed, only observed, only tautomer, identical tautomer
Numeric classification of quantitative amount of tautomers: 0: ratio = 0.0 - 0.0099 1: ratio = 0.01 - 0.30 2: ratio = 0.31 - 0.69 3: ratio = 0.70 - 0.99 4: ratio = 1
- Parameters
return_X_y (bool, default=False) – If True, returns
(data, target)
instead of a Bunch object.as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below.
as_regression (bool, default=False) – If True, returns regression subset instead of classes
- Returns
data (
Bunch
) – Dictionary-like object, with the following attributes.- datandarray of shape (n, )
The data array.
- targetndarray of shape (n, )
The regression or classification target.
- feature_names: list
The name of the dataset (‘Tautomers’).
- target_names: list
The name of target ([‘ratio or category’]).
- frame: DataFrame of shape (n, 2)
Only present when as_frame=True. DataFrame with data and target.
(data, target) (tuple if
return_X_y
is True)
- CIMtools.datasets.load_sn2(*, return_X_y=False, as_frame=False)
Load and return bimolecular nucleophilic substitution (SN2) reactions dataset (regression).
Samples total
4830
Data
reactions, type: ReactionContainer
Targets
real logK (-7.68) - 1.65, type: float
Read more in the article: Gimadiev, T.; Madzhidov, T.; Tetko, I.; Nugmanov, R.; Casciuc, I.; Klimchuk, O.; Bodrov, A.; Polishchuk, P.; Antipin, I.; Varnek, A. Bimolecular Nucleophilic Substitution Reactions: Predictive Models for Rate Constants and Molecular Reaction Pairs Analysis. J. Mol. Inf. 2018, 38, 1800104, doi:10.1002/minf.201800104.
- Parameters
return_X_y (bool, default=False) – If True, returns
(data, target)
instead of a Bunch object.as_frame (bool, default=False) – If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric). The target is a pandas DataFrame or Series depending on the number of target columns. If return_X_y is True, then (data, target) will be pandas DataFrames or Series as described below.
- Returns
data (
Bunch
) – Dictionary-like object, with the following attributes.- datandarray of shape (4830, )
The data array.
- targetndarray of shape (4830, )
The regression target (logarithm of the reaction rate constant).
- feature_names: list
The name of the dataset (‘SN2 reactions’).
- target_names: list
The name of target ([‘logK’]).
- frame: DataFrame of shape (4830, 2)
Only present when as_frame=True. DataFrame with data and target.
(data, target) (tuple if
return_X_y
is True)
- CIMtools.datasets.molconvert_chemaxon(data)
ChemAxon molconvert wrapper.
- Parameters
data (Buffer or string or path to file) – All supported by molconvert formats for chemical data storing.
- Returns
array – CGRtools data types for Reactions and Molecules storing.
- Return type
Array of molecules of reactions