Available descriptors

Avalon

class optunaz.descriptors.Avalon(name, parameters)[source]

Avalon Descriptor

Avalon (see Gedeck P, et al. QSAR-how good is it in practice?) uses a fingerprint generator in a similar to way to Daylight fingerprints, but enumerates with custom feature classes of the molecular graph ( see ref. paper for the 16 feature classes used). Hash codes for the path-style features are computed implicitly during enumeration. Avalon generated the largest number of good models in the reference study, which is likely since the fingerprint generator was tuned toward the features contained in the data set.

ECFP

class optunaz.descriptors.ECFP(name, parameters)[source]

Binary Extended Connectivity Fingerprint (ECFP).

ECFP (see Rogers et al. “Extended-Connectivity Fingerprints.”) [also known as Circular Fingerprints or Morgan Fingerprints], are built by applying the Morgan algorithm to a set of user-supplied atom invariants. This approach (implemented here using GetMorganFingerprintAsBitVect from RDKit) systematically records the neighborhood of each non-H atom into multiple circular layers up to a given radius (provided at runtime). The substructural features are mapped to integers using a hashing procedure (length of the hash provided at runtime). It is the set of the resulting identifiers that defines ECFPs. The diameter of the atom environments is appended to the name (e.g. ECFP4 corresponds to radius=2).

class Parameters(radius=3, nBits=2048, returnRdkit=False)[source]
Parameters:
  • radius (int) – Radius of the atom environments considered. Note that the 4 in ECFP4 corresponds to the diameter of the atom environments considered, while here we use radius. For example, radius=2 would correspond to ECFP4. - minimum: 1, title: radius

  • nBits (int) – Number of bits in the fingerprint, sometimes also called size. - minimum: 1, title: nBits

  • returnRdkit (bool) –

calculate_from_mol(mol)[source]

Returns a descriptor (fingerprint) for a given RDKit Mol as a 1-d Numpy array.

ECFP_counts

class optunaz.descriptors.ECFP_counts(name, parameters)[source]

ECFP With Counts

Binary Extended Connectivity Fingerprint (ECFP) With Counts.

ECFP (see Rogers et al. “Extended-Connectivity Fingerprints.”) [also known as Circular Fingerprints or Morgan Fingerprints] With Counts are built similar to ECFP fingerprints, however this approach (implemented using GetHashedMorganFingerprint from RDKit) systematically records the count vectors rather than bit vectors. Bit vectors track whether features appear in a molecule while count vectors track the number of times each feature appears. The diameter of the atom environments is appended to the name (e.g. ECFP4 corresponds to radius=2).

class Parameters(radius=3, useFeatures=True, nBits=2048)[source]
Parameters:
  • radius (int) – Radius of the atom environments considered. For ECFP4 (diameter=4) set radius=2 - minimum: 1, title: radius

  • useFeatures (bool) – Use feature fingerprints (FCFP), instead of normal ones (ECFP). RDKit feature definitions are adapted from the definitions in Gobbi & Poppinger, Biotechnology and Bioengineering 61, 47-54 (1998). FCFP and ECFP will likely lead to different fingerprints/similarity scores. - title: useFeatures

  • nBits (int) – Number of bits in the fingerprint, sometimes also called size. - minimum: 1, title: nBits

calculate_from_mol(mol)[source]

Returns a descriptor (fingerprint) for a given RDKit Mol as a 1-d Numpy array.

PathFP

class optunaz.descriptors.PathFP(name, parameters)[source]

Path fingerprint based on RDKit FP Generator.

This is a Path fingerprint.

class Parameters(maxPath=3, fpSize=2048)[source]
Parameters:
  • maxPath (int) – Maximum path for the fingerprint - minimum: 1, title: maxPath

  • fpSize (int) – Number size of the fingerprint, sometimes also called bit size. - minimum: 1, title: fpSize

calculate_from_mol(mol)[source]

Returns a descriptor (fingerprint) for a given RDKit Mol as a 1-d Numpy array.

MACCS_keys

class optunaz.descriptors.MACCS_keys(name, parameters=MACCS_keys.Parameters())[source]

MACCS

Molecular Access System (MACCS) fingerprint.

MACCS fingerprints (often referred to as MDL keys after the developing company ) are calculated using keysets originally constructed and optimized for substructure searching (see Durant et al. Reoptimization of MDL keys for use in drug discovery) are 166-bit 2D structure fingerprints.

Essentially, they are a binary fingerprint (zeros and ones) that answer 166 fragment related questions. If the explicitly defined fragment exists in the structure, the bit in that position is set to 1, and if not, it is set to 0. In that sense, the position of the bit matters because it is addressed to a specific question or a fragment. An atom can belong to multiple MACCS keys, and since each bit is binary, MACCS 166 keys can represent more than 9.3×1049 distinct fingerprint vectors.

class Parameters[source]
calculate_from_mol(mol)[source]

Returns a descriptor (fingerprint) for a given RDKit Mol as a 1-d Numpy array.

UnscaledPhyschemDescriptors

class optunaz.descriptors.UnscaledPhyschemDescriptors(name='UnscaledPhyschemDescriptors', parameters=UnscaledPhyschemDescriptors.Parameters(rdkit_names=None))[source]

Base (unscaled) PhyschemDescriptors (RDKit) for PhyschemDescriptors

These physchem descriptors are unscaled and should be used with caution. They are a set of 208 physchem/molecular properties that are calculated in RDKit and used as descriptor vectors for input molecules. Features include ClogP, MW, # of atoms, rings, rotatable bonds, fraction sp3 C, graph invariants (Kier indices etc), TPSA, Slogp descriptors, counts of some functional groups, VSA MOE-type descriptors, estimates of atomic charges etc. (See https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors).

Vectors whose components are molecular descriptors have been used as high-level feature representations for molecular machine learning. One advantage of molecular descriptor vectors is their interpretability, since the meaning of a physicochemical descriptor can be intuitively understood

class Parameters(rdkit_names: Optional[List[str]] = None)[source]
calculate_from_mol(mol)[source]

Returns a descriptor (fingerprint) for a given RDKit Mol as a 1-d Numpy array.

UnscaledJazzyDescriptors

class optunaz.descriptors.UnscaledJazzyDescriptors(name='UnscaledJazzyDescriptors', parameters=UnscaledJazzyDescriptors.Parameters(jazzy_names=None, jazzy_filters=None))[source]

Base (unscaled) Jazzy descriptors

These Jazzy descriptors are unscaled and should be used with caution. They offer a molecular vector describing the hydration free energies and hydrogen-bond acceptor and donor strengths. A publication describing the implementation, fitting, and validation of Jazzy can be found at doi.org/10.1038/s41598-023-30089-x. These descriptors use the “MMFF94” minimisation method. NB: this descriptor employs a threshold of <50 Hydrogen acceptors/donors and a Mw of <1000Da for compound inputs.

class Parameters(jazzy_names: Optional[List[str]] = None, jazzy_filters: Optional[Dict[str, Any]] = None)[source]
calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

UnscaledZScalesDescriptors

class optunaz.descriptors.UnscaledZScalesDescriptors(name='UnscaledZScalesDescriptors', parameters=UnscaledZScalesDescriptors.Parameters())[source]

Unscaled Z-Scales.

Compute the Z-scales of a peptide SMILES. These Z-Scales descriptors are unscaled and should be used with caution.

class Parameters[source]
calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

PhyschemDescriptors

class optunaz.descriptors.PhyschemDescriptors(parameters=PhyschemDescriptors.Parameters(rdkit_names=None, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledPhyschemDescriptors'>), name='PhyschemDescriptors')[source]

PhyschemDescriptors (scaled) calculated in RDKit

A set of 208 physchem/molecular properties that are calculated in RDKit and used as descriptor vectors for input molecules. Features include ClogP, MW, # of atoms, rings, rotatable bonds, fraction sp3 C, graph invariants (Kier indices etc), TPSA, Slogp descriptors, counts of some functional groups, VSA MOE-type descriptors, estimates of atomic charges etc. (See https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors).

Vectors whose components are molecular descriptors have been used as high-level feature representations for molecular machine learning. One advantage of molecular descriptor vectors is their interpretability, since the meaning of a physicochemical descriptor can be intuitively understood

class Parameters(rdkit_names: Optional[List[str]] = None, scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledPhyschemDescriptors'>)[source]
descriptor

alias of UnscaledPhyschemDescriptors

JazzyDescriptors

class optunaz.descriptors.JazzyDescriptors(parameters=JazzyDescriptors.Parameters(jazzy_names=None, jazzy_filters=None, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledJazzyDescriptors'>), name='JazzyDescriptors')[source]

Scaled Jazzy descriptors

Jazzy descriptors offer a molecular vector describing the hydration free energies and hydrogen-bond acceptor and donor strengths. A publication describing the implementation, fitting, and validation of Jazzy can be found at doi.org/10.1038/s41598-023-30089-x. These descriptors use the “MMFF94” minimisation method. NB: Jazzy employs a threshold of <50 Hydrogen acceptors/donors and Mw of <1000Da for input compounds.

class Parameters(jazzy_names: Optional[List[str]] = None, jazzy_filters: Optional[Dict[str, Any]] = None, scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledJazzyDescriptors'>)[source]
descriptor

alias of UnscaledJazzyDescriptors

PrecomputedDescriptorFromFile

class optunaz.descriptors.PrecomputedDescriptorFromFile(name, parameters)[source]

Precomputed descriptors.

Users can supply a CSV file of feature vectors to use as descriptors, with headers on the first line. Each row corresponds to a compound in the training set, followed by a column that may have comma-separated vectors describing that molecule.

class Parameters(file=None, input_column=None, response_column=None)[source]
Parameters:
  • file (str) – Name of the CSV containing precomputed descriptors - title: file

  • input_column (str) – Name of input column with SMILES strings - title: Input Column

  • response_column (str) – Name of response column with the comma-separated vectors that the model will use as pre-computed descriptors - title: Response column

calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

inference_parameters(file, input_column, response_column)[source]

This function allows precomputed descriptors to be used for inference for a new file

ZScales

class optunaz.descriptors.ZScalesDescriptors(parameters=ZScalesDescriptors.Parameters(scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledZScalesDescriptors'>), name='ZScalesDescriptors')[source]

Scaled Z-Scales descriptors.

Z-scales were proposed in Sandberg et al (1998) based on physicochemical properties of proteogenic and non-proteogenic amino acids, including NMR data and thin-layer chromatography (TLC) data. Refer to doi:10.1021/jm9700575 for the original publication. These descriptors capture 1. lipophilicity, 2. steric properties (steric bulk and polarizability), 3. electronic properties (polarity and charge), 4. electronegativity (heat of formation, electrophilicity and hardness) and 5. another electronegativity. This fingerprint is the computed average of Z-scales of all the amino acids in the peptide.

class Parameters(scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledZScalesDescriptors'>)[source]
descriptor

alias of UnscaledZScalesDescriptors

SmilesFromFile

class optunaz.descriptors.SmilesFromFile(name, parameters=SmilesFromFile.Parameters())[source]

Smiles as descriptors (for ChemProp).

ChemProp optimisation runs require either this or SmilesAndSideInfoFromFile descriptor to be selected. This setting allows the SMILES to pass through to the ChemProp package.

class Parameters[source]
calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

SmilesAndSideInfoFromFile

class optunaz.descriptors.SmilesAndSideInfoFromFile(name, parameters)[source]

SMILES & side information descriptors (for ChemProp).

ChemProp optimisation requires either these or SmilesFromFile descriptors. This descriptor allows SMILES to pass through to ChemProp, _and_ for side information to be supplied as auxiliary tasks.

Side information can take the form of any vector (continuous or binary) which describe input compounds. All tasks are learnt in a multi-task manner to improve main-task (task of intent) predictions. Side information can boost performance since their contribution to network loss can lead to improved learnt molecular representations.

Optimal side information weighting (how much auxiliary tasks contribute to network loss) is also an (optional) learned parameter during optimisation.

Similar to PrecomputedDescriptorFromFile, CSV inputs for this descriptor should contain a SMILES column of input molecules. All vectors in the remaining columns are used as user-derived side-information (i.e: be cautious to only upload a CSV with side information tasks in columns since _all_ are used)

(see https://ruder.io/multi-task/index.html#auxiliarytasks for details).

class Parameters(file=None, input_column=None, aux_weight_pc=SmilesAndSideInfoFromFile.Parameters.Aux_Weight_Pc(low=100, high=100, q=20))[source]
Parameters:
  • file (str) – Name of the CSV containing precomputed side-info descriptors - title: file

  • input_column (str) – Name of input column with SMILES strings - title: Input Column

  • aux_weight_pc (Aux_Weight_Pc) – How much (%) auxiliary tasks (side information) contribute (%)to the loss function optimised during training. The larger the number, the larger the weight of side information. - title: Auxiliary weight percentage

class Aux_Weight_Pc(low: int = 100, high: int = 100, q: int = 20)[source]
calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

ScaledDescriptor

class optunaz.descriptors.ScaledDescriptor(parameters, name='ScaledDescriptor')[source]

Scaled Descriptor.

This descriptor is not a complete descriptor, but instead it wraps and scales another descriptor.

Some algorithms require input to be within certain range, e.g. [-1..1]. Some descriptors have different ranges for different columns/features. This descriptor wraps another descriptor and provides scaled values.

class ScaledDescriptorParameters(descriptor: Union[Avalon, ECFP, ECFP_counts, PathFP, AmorProtDescriptors, MACCS_keys, PrecomputedDescriptorFromFile, UnscaledMAPC, UnscaledPhyschemDescriptors, UnscaledJazzyDescriptors, UnscaledZScalesDescriptors], scaler: Union[FittedSklearnScaler, UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'))[source]
calculate_from_smi(smi, cache=None)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

CompositeDescriptor

class optunaz.descriptors.CompositeDescriptor(parameters, name='CompositeDescriptor')[source]

Composite descriptor

Concatenates multiple descriptors into one. Select multiple algorithms from the button below. Please note the ChemProp SMILES descriptors are not compatible with this function.

class Parameters(descriptors: List[Union[Avalon, ECFP, ECFP_counts, PathFP, AmorProtDescriptors, MACCS_keys, PrecomputedDescriptorFromFile, UnscaledMAPC, UnscaledPhyschemDescriptors, UnscaledJazzyDescriptors, UnscaledZScalesDescriptors, ScaledDescriptor, MAPC, PhyschemDescriptors, JazzyDescriptors, ZScalesDescriptors]])[source]
calculate_from_smi(smi, cache=None)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.