AtomBondFeaturizer

BONAFIDE main module.

class bonafide.bonafide.AtomBondFeaturizer(log_file_name='bonafide.log')[source]

Bases: _AtomBondFeaturizer

Main class of the Bond and Atom Featurizer and Descriptor Extractor (BONAFIDE).

It implements all the methods available to the user to calculate atom and or bond-specific features.

Parameters:
log_file_namestr, optional

The name of the log file to which all logging messages are written, by default “bonafide.log”. A file with this name cannot already exists.

Attributes:
_atom_feature_indices_2DList[int]

The list of atom feature indices that can be calculated for molecules for which only 2D information is available.

_atom_feature_indices_3DList[int]

The list of atom feature indices that can be calculated for molecules for which 3D information is available.

_bond_feature_indices_2DList[int]

The list of bond feature indices that can be calculated for molecules for which only 2D information is available.

_bond_feature_indices_3DList[int]

The list of bond feature indices that can be calculated for molecules for which 3D information is available.

_feature_configDict[str, Any]

The configuration settings for the individual programs used for feature calculation. The default settings are loaded from the _feature_config.toml file. The current settings can be inspected with the print_options() method and changed using the set_options() method.

_feature_infoDict[int, Dict[str, Any]]

The metadata of all implemented atom and bond features, e.g., the name of the feature, its dimensionality requirements (either 2D or 3D), or the program it is calculated with (origin). The data is loaded from the _feature_info.json file and should not be manually modified.

_feature_info_dfpd.DataFrame

A pandas DataFrame containing the feature indices (as index of the DataFrame) and their key characteristics of all implemented atom and bond features.

_functional_groups_smartsDict[str, List[Tuple[str, Chem.rdchem.Mol]]]

A dictionary containing the names and SMARTS patterns of different functional groups.

_init_directorystr

The path to the directory where the AtomBondFeaturizer object was initialized.

_keep_output_filesbool

If True, all output files created during the feature calculations are kept. If False, they are removed when the calculation is done.

_locstr

The location string representing the current class and method for logging purposes.

_namespaceOptional[str]

The namespace for the molecule as defined by the user when reading in the molecule.

_output_directoryOptional[str]

The path to the directory where all output files created during the feature calculations are stored (if requested).

_periodic_tableDict[str, element]

A dictionary representing the periodic table with element symbols as keys and mendeleev element objects as values.

mol_vaultOptional[MolVault]

Dataclass object for storing all relevant data on the molecule for which features should be calculated.

add_custom_featurizer(custom_metadata)[source]

Add a custom featurizer to the BONAFIDE framework.

After successfully calling this method, the custom feature is assigned its own feature index and can be used like any other built-in feature.

Parameters:
custom_metadataDict[str, Any]

A dictionary containing the required information on the custom featurizer. It must contain the following data:

  • name (str): The name of the custom feature.

  • origin (str): The origin program of the custom feature (e.g., “custom”)

  • feature_type (str): The type of the custom feature (either “atom” or “bond”).

  • dimensionality (str): The dimensionality of the custom feature (either “2D” or “3D”).

  • data_type (str): The data type of the custom feature specified as string (either “str”, “int”, “float”, or “bool”).

  • requires_electronic_structure_data (bool): Whether electronic structure data is required for calculating the custom feature.

  • requires_bond_data (bool): Whether bond data is required for calculating the custom feature.

  • requires_charge (bool): Whether the charge of the molecule is required for calculating the custom feature.

  • requires_multiplicity (bool): Whether the multiplicity of the molecule is required for calculating the custom feature.

  • config_path (dict): Dictionary of optional parameters passed to the custom featurizer. The keys of this dictionary will be available as attributes in the custom featurizer class.

  • factory (callable): The factory class for calculating the custom feature. It must inherit from BaseFeaturizer from bonafide/utils/base_featurizer.py.

Returns:
None
attach_electronic_structure(electronic_structure_data, state='n')[source]

Attach electronic structure data files to a molecule vault hosting a 3D molecule.

The input can either be a single file path or a list of file paths. The state parameter allows to specify to which redox state of the molecule the electronic structure data should be attached to.

Parameters:
electronic_structure_dataUnion[str, List[str]]

A list of file paths to the electronic structure files or a single file path.

statestr, optional

The redox state of the electronic structure data to be attached, by default “n”. Can either be

  • “n” (actual molecule),

  • “n+1” (actual molecule plus one electron), or

  • “n-1” (actual molecule minus one electron).

Returns:
None
attach_energy(energy_data, state='n', prune_by_energy=None)[source]

Attach molecular energy values to a molecule vault hosting a 3D molecule.

The input to energy_data can either be a single 2-tuple or a list of 2-tuples. Each 2-tuple must contain the energy value (first entry) and the respective energy unit (second entry). Supported energy units are “Eh”, “kcal/mol”, and “kJ/mol”.

The state parameter allows to specify to which redox state of the molecule the energy values should be attached to.

If desired, the conformer ensemble can be pruned based on the attached energy values for state “n” (actual molecule) through the prune_by_energy parameter.

Parameters:
energy_dataUnion[Tuple[Union[int, float], str], List[Tuple[Union[int, float], str]]]

A 2-tuple or a list of 2-tuples containing the energy values and respective units.

statestr, optional

The redox state of the electronic structure data to be attached, by default “n”. Can either be

  • “n” (actual molecule),

  • “n+1” (actual molecule plus one electron), or

  • “n-1” (actual molecule minus one electron).

prune_by_energyOptional[Tuple[Union[int, float], str]], optional

If a value other than None is provided, all conformers with a relative energy above this value are set to be invalid and ignored during feature calculation and any further processing. The input must be a 2-tuple in which the first entry is the relative energy cutoff value and the second entry is the respective energy unit. Supported units are “Eh”, “kcal/mol”, and “kJ/mol”. If None, no pruning is performed, by default None.

attach_smiles(smiles, align=True, connectivity_method='connect_the_dots', covalent_radius_factor=1.3)[source]

Attach a SMILES string to a molecule vault that is hosting a 3D molecule.

Before attaching a SMILES string, the compatibility of the SMILES string with the already existing molecule in the vault is checked. The align parameter allows to decide whether to keep the initial atom order (align=True) or apply the one of the SMILES string (align=False).

The additional optional parameters connectivity_method and covalent_radius_factor influence how the atom connectivity of the RDKit molecule object(s) initially hosted in the molecule vault is determined (required for attaching the SMILES string).

A SMILES string can only be attached to a molecule vault for which the bonds are not determined yet. This also means that once a SMILES string is attached to a molecule vault, it cannot be changed anymore. A SMILES string cannot be attached to a molecule vault hosting a 2D molecule.

Parameters:
smilesstr

The SMILES string that should be attached to the molecule vault.

alignbool, optional

If True, the atom indices of the initially provided 3D structures are preserved, if False, the atoms are re-ordered according to the order in the SMILES string, by default True.

connectivity_methodstr

The name of the method that is used to determine the atom connectivity. Available options are “connect_the_dots”, “van_der_waals”, and “hueckel”.

covalent_radius_factorfloat

A scaling factor that is applied to the covalent radii of the atoms when determining the bonds with the van-der-Waals method.

Returns:
None
calculate_electronic_structure(engine, redox='n', prune_by_energy=None)[source]

Calculate the electronic structure of all conformers of a molecule vault hosting a 3D molecule.

The calculation can be performed with either the Psi4 or xtb engine. The redox parameter allows to select for which redox states the electronic structure should be calculated.

Parameters:
enginestr

The name of the electronic structure program to be used, either “psi4” or “xtb”.

redoxstr, optional

The redox state for which the electronic structure should be calculated. Can either be

  • “n” (only the actual molecule is calculated),

  • “n-1” (the actual molecule and its one-electron-oxidized form are calculated),

  • “n+1” (the actual molecule and its one-electron-reduced form are calculated), or

  • “all” (the actual molecule and both, its one-electron-reduced and -oxidized form are calculated), by default “n”.

prune_by_energyOptional[Tuple[Union[int, float], str]], optional

If a value other than None is provided, all conformers with a relative energy above this value are set to be invalid and ignored during feature calculation and any further processing. The input must be a 2-tuple in which the first entry is the relative energy cutoff value and the second entry is the respective energy unit. Supported units are “Eh”, “kcal/mol”, and “kJ/mol”. If None, no pruning is performed, by default None.

Returns:
None
clear_atom_feature_cache(origin=None)[source]

Clear the atom feature cache of the molecule vault.

This method can be used to clear previously calculated atom features from the feature cache of the molecule vault to recalculate them (e.g., after changing the configuration settings of a featurizer, see the set_options() method).

Parameters:
originOptional[Union[str, List[str]]]

The name or a list of the names of the program(s) of the feature(s) to be cleared (e.g., “rdkit”, “xtb”), by default None. If None, all features are cleared.

Returns:
None
clear_bond_feature_cache(origin=None)[source]

Clear the bond feature cache of the molecule vault.

This method can be used to clear previously calculated bond features from the feature cache of the molecule vault to recalculate them (e.g., after changing the configuration settings of a featurizer, see the set_options() method).

Parameters:
originOptional[Union[str, List[str]]]

The name or a list of the names of the program(s) of the feature(s) to be cleared (e.g., “rdkit”, “xtb”), by default None. If None, all features are cleared.

Returns:
None
determine_bonds(connectivity_method='connect_the_dots', covalent_radius_factor=1.3, allow_charged_fragments=True, embed_chiral=True)[source]

Determine the chemical bonds of each conformer of a molecule vault hosting a 3D molecule.

This method can be used to define the chemical bonds of a molecule that was provided without information on the bonds (connectivity and bond type). Bond information is required for the calculation of certain atom and all bond features.

The optional parameters connectivity_method, covalent_radius_factor, allow_charged_fragments, and embed_chiral influence how the bonds of the individual RDKit molecule object(s) are.

Parameters:
connectivity_methodstr

The name of the method that is used to determine the atom connectivity and bond type. Available options are “connect_the_dots”, “van_der_waals”, and “hueckel”.

covalent_radius_factorfloat

A scaling factor that is applied to the covalent radii of the atoms when determining the bonds with the van-der-Waals method.

allow_charged_fragmentsbool, optional

If True, fragments with a net charge are allowed when determining the bonds of the molecule, by default True.

embed_chiralbool, optional

If True, chiral centers are embedded when determining the bonds of the molecule, by default True.

Returns:
None
featurize_atoms(atom_indices, feature_indices)[source]

Calculate one or multiple features for selected or all atoms.

A list of all available atom features can be obtained with the list_atom_features() method. For certain features, 3D information, electronic structure data or information on the chemical bonds in the molecule is required.

Parameters:
atom_indicesUnion[str, int, List[int]]

The indices of the atoms to be featurized. Can be a single index, a list of indices, or “all” to consider all atoms.

feature_indicesUnion[str, int, List[int]]

The indices of the features to be calculated. Can be a single index, a list of indices, or “all” to consider all atom features.

Returns:
None
featurize_bonds(bond_indices, feature_indices)[source]

Calculate one or multiple features for selected or all bonds.

A list of all available bond features can be obtained with the list_bond_features() method. For all bond features, information on the chemical bonds in the molecule is required. Some bond features further require 3D information or electronic structure data.

Parameters:
bond_indicesUnion[str, int, List[int]]

The indices of the bonds to be featurized. Can be a single index, a list of indices, or “all” to consider all bonds.

feature_indicesUnion[str, int, List[int]]

The indices of the features to be calculated. Can be a single index, a list of indices, or “all” to consider all bond features.

Returns:
None
list_atom_features(**kwargs)[source]

Display all available atom features.

The DataFrame can be filtered with the following optional keyword arguments:

  • name

  • origin

  • dimensionality

  • data_type

  • requires_electronic_structure_data

  • requires_bond_data

  • requires_charge

  • requires_multiplicity

  • config_path

  • factory

Parameters:
**kwargsAny

Additional optional keyword arguments for filtering the feature DataFrame. If empty, all atom features are returned.

Returns:
pd.DataFrame

A pandas DataFrame containing the selected atom features and their characteristics.

list_bond_features(**kwargs)[source]

Display all available bond features.

The DataFrame can be filtered with the following optional keyword arguments:

  • name

  • origin

  • dimensionality

  • data_type

  • requires_electronic_structure_data

  • requires_bond_data

  • requires_charge

  • requires_multiplicity

  • config_path

  • factory

Parameters:
**kwargsAny

Additional optional keyword arguments for filtering the feature DataFrame. If empty, all bond features are returned.

Returns:
pd.DataFrame

A pandas DataFrame containing the selected bond features and their characteristics.

print_options(origin=None)[source]

Print the configuration settings of the individual programs for feature calculation.

By providing input to the origin parameter, it can be selected which program’s settings are printed. Valid origins are:

  • alfabet

  • bonafide

  • dbstep

  • dscribe

  • kallisto

  • mendeleev

  • morfeus

  • multiwfn

  • psi4

  • qmdesc

  • rdkit

  • xtb

Parameters:
originOptional[Union[str, List[str]]], optional

The name(s) of the program(s) for which the configuration settings should be printed. Can either be given as string or list of multiple programs, by default None. If kept None, the settings of all programs are printed.

Returns:
None
read_input(input_value, namespace, input_format='smiles', read_energy=False, prune_by_energy=None, output_directory=None)[source]

Read in a SMILES string, an input file (either XYZ or SDF), or an RDKit molecule object.

By default, the input_format parameter is set to “smiles”, meaning that a SMILES string can be passed to the method without specifying input_format. If a file should be read in, input_format must be set to “file”; for an RDKit molecule object, it must be set to “mol_object”.

If it is intended to read in energies from the input file or the RDKit molecule object (if available), the read_energy parameter must be set to True. This will set the energies in the molecule vault for state “n” (actual molecule). Alternatively, the attach_energy() method can be used to attach energy data to the molecule vault after reading in the molecule. This method also allows to attach energies for different redox states (“n” (actual molecule), “n+1” (one-electron reduced molecule), “n-1” (one-electron oxidized molecule)).

Energy data must always be specified as strings containing the value and the respective unit separated by a space, for example, "-10.5 kcal/mol" or "-1254.21548 Eh". Supported energy units are “Eh”, “kcal/mol”, and “kJ/mol”.

It is possible to prune the conformer ensemble through the prune_by_energy parameter. Pruning is done based on relative energies (of state “n”) with respect to the lowest-energy conformer in the molecule vault.

Passing an input to output_directory allows to specify where all output files created during the feature calculations are stored. If kept None, all output files are deleted.

Parameters:
input_valueUnion[str, Chem.rdchem.Mol]

The path to the input file, a SMILES string, or an RDKit molecule object.

namespacestr

The namespace for the molecule that is read in. This identifier is used throughout all following BONAFIDE processes including logging.

input_formatstr, optional

The type of input. Can either be “file” or “smiles”, by default “smiles”.

read_energybool, optional

If True, it is attempted to read in energies from the input file (if available), by default False. These energies are set for state “n” (actual molecule).

prune_by_energyOptional[Tuple[Union[int, float], str]], optional

If a value other than None is provided, all conformers with a relative energy above this value are set to be invalid and ignored during feature calculation and any further processing. The input must be a 2-tuple in which the first entry is the relative energy cutoff value and the second entry is the respective energy unit. Supported units are “Eh”, “kcal/mol”, and “kJ/mol”. If None, no pruning is performed, by default None.

output_directoryOptional[str], optional

The path to the directory where all output files created during the feature calculations are stored. If kept None, no output files folder is created and all output files are deleted after data extraction.

Returns:
None
return_atom_features(atom_indices='all', output_format='df', reduce=False, temperature=298.15, ignore_invalid=True)[source]

Return the calculated atom features after feature calculation.

The features of selected or all atoms can be returned as a pandas DataFrame, a hierarchical dictionary, or as one or multiple RDKit molecule objects with the features embedded as atom properties.

If a dictionary is requested as output format, the outer dictionary keys correspond to the atom indices. The values are dictionaries in which the keys are the feature names and the values are the respective feature values.

Parameters:
atom_indicesUnion[str, int, List[int]], optional

The indices of the atoms for which features should be returned. If features are requested for atoms for which no data was calculated, the feature value will be NaN. The input to atom_indices can be a single index, a list of indices, or “all” to consider all atoms, by default “all”.

output_formatstr, optional

The name of the desired output format, can be “df”, “dict”, or “mol_object”. If “df” is selected, a pandas DataFrame is returned. If “dict” is selected, the features are returned as a hierarchical dictionary. If “mol_object” is selected, one or multiple RDKit molecule objects with the features embedded as atom properties are returned, by default “df”.

reducebool, optional

This is only relevant for molecule vaults hosting a 3D molecule with more than one conformer. If True, the features are reduced to a single value per atom across all conformers reporting the minimum, maximum, and mean value for each feature. In addition, if energy data is available in the molecule vault, the Boltzmann-weighted average value at the provided temperature is reported as well as the data for the lowest- and highest-energy conformer. If False, the features are returned for each conformer separately, by default False.

temperatureUnion[int, float], optional

The temperature in Kelvin at which the Boltzmann-weighted values are calculated, by default 298.15.

ignore_invalidbool, optional

If set to True, the presence of any invalid conformer in the molecule vault will be ignored during feature reduction. If is set to False, the presence of any invalid conformer will lead to returning the unreduced features. Note that in both cases, invalid conformers are ignored when calculating the mean, min, and max feature values.

Returns:
Union[pd.DataFrame, Dict[int, Dict[str, Any]], List[Chem.rdchem.Mol], Chem.rdchem.Mol]

The atom features in the desired output format.

return_bond_features(bond_indices='all', output_format='df', reduce=False, temperature=298.15, ignore_invalid=True)[source]

Return the calculated bond features after feature calculation.

The features of selected or all bonds can be returned as a pandas DataFrame, a hierarchical dictionary, or as one or multiple RDKit molecule objects with the features embedded as bond properties.

If a dictionary is requested as output format, the outer dictionary keys correspond to the bond indices. The values are dictionaries in which the keys are the feature names and the values are the respective feature values.

Parameters:
bond_indicesUnion[str, int, List[int]], optional

The indices of the bonds for which features should be returned. If features are requested for bonds for which no data was calculated, the feature value will be NaN. The input to bond_indices can be a single index, a list of indices, or “all” to consider all bonds, by default “all”.

output_formatstr, optional

The name of the desired output format, can be “df”, “dict”, or “mol_object”. If “df” is selected, a pandas DataFrame is returned. If “dict” is selected, the features are returned as a hierarchical dictionary. If “mol_object” is selected, one or multiple RDKit molecule objects with the features embedded as bond properties are returned, by default “df”.

reducebool, optional

This is only relevant for molecule vaults hosting a 3D molecule with more than one conformer. If True, the features are reduced to a single value per bond across all conformers reporting the minimum, maximum, and mean value for each feature. In addition, if energy data is available in the molecule vault, the Boltzmann-weighted average value at the provided temperature is reported as well as the data for the lowest- and highest-energy conformer. If False, the features are returned for each conformer separately, by default False.

temperatureUnion[int, float], optional

The temperature in Kelvin at which the Boltzmann-weighted values are calculated, by default 298.15.

ignore_invalidbool, optional

If set to True, the presence of any invalid conformer in the molecule vault will be ignored during feature reduction. If is set to False, the presence of any invalid conformer will lead to returning the unreduced features. Note that in both cases, invalid conformers are ignored when calculating the mean, min, and max feature values.

Returns:
Union[pd.DataFrame, Dict[int, Dict[str, Any]], List[Chem.rdchem.Mol], Chem.rdchem.Mol]

The bond features in the desired output format.

set_charge(charge)[source]

Set the charge of the molecule.

Parameters:
chargeint

The total charge of the molecule that is used for feature calculation.

Returns:
None
set_multiplicity(multiplicity)[source]

Set the multiplicity of the molecule.

Parameters:
multiplicityint

The spin multiplicity of the molecule that is used for feature calculation.

Returns:
None
set_options(configs)[source]

Change configuration settings for the individual programs used for feature calculation.

The input to this method must be a 2-tuples (or a list thereof), where the first entry is the path to the configuration setting that should be changed (point-separated) and the second entry is the new value.

For listing all available configuration settings and their current values, see the print_options() method.

Parameters:
configsUnion[Tuple[str, Any], List[Tuple[str, Any]]]

A 2-tuple or a list of 2-tuples containing the configuration paths and their new values, e.g.: (“bonafide.autocorrelation.depth”, 3)

Returns:
None
show_molecule(index_type='atom', in_3D=False, image_size=(500, 500))[source]

Display the molecule with atom, bond or no indices.

Molecules can either be shown in an interactive 3D view (if 3D information is available) or in 2D as a Lewis structure.

Parameters:
index_typestr, optional

The type of indices to add to the structure, either “atom”, “bond”, or None. By default “atom”.

in_3Dbool, optional

If True, the molecule is shown in 3D (if 3D information is available), by default False.

image_sizeTuple[int, int], optional

The size of the displayed image in pixels (width, height), by default (500, 500).

Returns:
Union[PngImagePlugin.PngImageFile, ipywidgets.VBox]

A 2D or 3D depiction of the molecule, either as an image or an interactive 3D view.