Available descriptors
Avalon
- class optunaz.descriptors.Avalon(name, parameters)[source]
Avalon Descriptor
Avalon (see Gedeck P, et al. QSAR-how good is it in practice?) uses a fingerprint generator in a similar to way to Daylight fingerprints, but enumerates with custom feature classes of the molecular graph ( see ref. paper for the 16 feature classes used). Hash codes for the path-style features are computed implicitly during enumeration. Avalon generated the largest number of good models in the reference study, which is likely since the fingerprint generator was tuned toward the features contained in the data set.
ECFP
- class optunaz.descriptors.ECFP(name, parameters)[source]
Binary Extended Connectivity Fingerprint (ECFP).
ECFP (see Rogers et al. “Extended-Connectivity Fingerprints.”) [also known as Circular Fingerprints or Morgan Fingerprints], are built by applying the Morgan algorithm to a set of user-supplied atom invariants. This approach (implemented here using GetMorganFingerprintAsBitVect from RDKit) systematically records the neighborhood of each non-H atom into multiple circular layers up to a given radius (provided at runtime). The substructural features are mapped to integers using a hashing procedure (length of the hash provided at runtime). It is the set of the resulting identifiers that defines ECFPs. The diameter of the atom environments is appended to the name (e.g. ECFP4 corresponds to radius=2).
- class Parameters(radius=3, nBits=2048, returnRdkit=False)[source]
- Parameters:
radius (int) – Radius of the atom environments considered. Note that the 4 in ECFP4 corresponds to the diameter of the atom environments considered, while here we use radius. For example, radius=2 would correspond to ECFP4. - minimum: 1, title: radius
nBits (int) – Number of bits in the fingerprint, sometimes also called size. - minimum: 1, title: nBits
returnRdkit (bool) –
ECFP_counts
- class optunaz.descriptors.ECFP_counts(name, parameters)[source]
ECFP With Counts
Binary Extended Connectivity Fingerprint (ECFP) With Counts.
ECFP (see Rogers et al. “Extended-Connectivity Fingerprints.”) [also known as Circular Fingerprints or Morgan Fingerprints] With Counts are built similar to ECFP fingerprints, however this approach (implemented using GetHashedMorganFingerprint from RDKit) systematically records the count vectors rather than bit vectors. Bit vectors track whether features appear in a molecule while count vectors track the number of times each feature appears. The diameter of the atom environments is appended to the name (e.g. ECFP4 corresponds to radius=2).
- class Parameters(radius=3, useFeatures=True, nBits=2048)[source]
- Parameters:
radius (int) – Radius of the atom environments considered. For ECFP4 (diameter=4) set radius=2 - minimum: 1, title: radius
useFeatures (bool) – Use feature fingerprints (FCFP), instead of normal ones (ECFP). RDKit feature definitions are adapted from the definitions in Gobbi & Poppinger, Biotechnology and Bioengineering 61, 47-54 (1998). FCFP and ECFP will likely lead to different fingerprints/similarity scores. - title: useFeatures
nBits (int) – Number of bits in the fingerprint, sometimes also called size. - minimum: 1, title: nBits
PathFP
- class optunaz.descriptors.PathFP(name, parameters)[source]
Path fingerprint based on RDKit FP Generator.
This is a Path fingerprint.
MACCS_keys
- class optunaz.descriptors.MACCS_keys(name, parameters=MACCS_keys.Parameters())[source]
MACCS
Molecular Access System (MACCS) fingerprint.
MACCS fingerprints (often referred to as MDL keys after the developing company ) are calculated using keysets originally constructed and optimized for substructure searching (see Durant et al. Reoptimization of MDL keys for use in drug discovery) are 166-bit 2D structure fingerprints.
Essentially, they are a binary fingerprint (zeros and ones) that answer 166 fragment related questions. If the explicitly defined fragment exists in the structure, the bit in that position is set to 1, and if not, it is set to 0. In that sense, the position of the bit matters because it is addressed to a specific question or a fragment. An atom can belong to multiple MACCS keys, and since each bit is binary, MACCS 166 keys can represent more than 9.3×1049 distinct fingerprint vectors.
UnscaledPhyschemDescriptors
- class optunaz.descriptors.UnscaledPhyschemDescriptors(name='UnscaledPhyschemDescriptors', parameters=UnscaledPhyschemDescriptors.Parameters(rdkit_names=None))[source]
Base (unscaled) PhyschemDescriptors (RDKit) for PhyschemDescriptors
These physchem descriptors are unscaled and should be used with caution. They are a set of 208 physchem/molecular properties that are calculated in RDKit and used as descriptor vectors for input molecules. Features include ClogP, MW, # of atoms, rings, rotatable bonds, fraction sp3 C, graph invariants (Kier indices etc), TPSA, Slogp descriptors, counts of some functional groups, VSA MOE-type descriptors, estimates of atomic charges etc. (See https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors).
Vectors whose components are molecular descriptors have been used as high-level feature representations for molecular machine learning. One advantage of molecular descriptor vectors is their interpretability, since the meaning of a physicochemical descriptor can be intuitively understood
UnscaledJazzyDescriptors
- class optunaz.descriptors.UnscaledJazzyDescriptors(name='UnscaledJazzyDescriptors', parameters=UnscaledJazzyDescriptors.Parameters(jazzy_names=None, jazzy_filters=None))[source]
Base (unscaled) Jazzy descriptors
These Jazzy descriptors are unscaled and should be used with caution. They offer a molecular vector describing the hydration free energies and hydrogen-bond acceptor and donor strengths. A publication describing the implementation, fitting, and validation of Jazzy can be found at doi.org/10.1038/s41598-023-30089-x. These descriptors use the “MMFF94” minimisation method. NB: this descriptor employs a threshold of <50 Hydrogen acceptors/donors and a Mw of <1000Da for compound inputs.
UnscaledZScalesDescriptors
PhyschemDescriptors
- class optunaz.descriptors.PhyschemDescriptors(parameters=PhyschemDescriptors.Parameters(rdkit_names=None, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledPhyschemDescriptors'>), name='PhyschemDescriptors')[source]
PhyschemDescriptors (scaled) calculated in RDKit
A set of 208 physchem/molecular properties that are calculated in RDKit and used as descriptor vectors for input molecules. Features include ClogP, MW, # of atoms, rings, rotatable bonds, fraction sp3 C, graph invariants (Kier indices etc), TPSA, Slogp descriptors, counts of some functional groups, VSA MOE-type descriptors, estimates of atomic charges etc. (See https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors).
Vectors whose components are molecular descriptors have been used as high-level feature representations for molecular machine learning. One advantage of molecular descriptor vectors is their interpretability, since the meaning of a physicochemical descriptor can be intuitively understood
- class Parameters(rdkit_names: Optional[List[str]] = None, scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledPhyschemDescriptors'>)[source]
- descriptor
alias of
UnscaledPhyschemDescriptors
JazzyDescriptors
- class optunaz.descriptors.JazzyDescriptors(parameters=JazzyDescriptors.Parameters(jazzy_names=None, jazzy_filters=None, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledJazzyDescriptors'>), name='JazzyDescriptors')[source]
Scaled Jazzy descriptors
Jazzy descriptors offer a molecular vector describing the hydration free energies and hydrogen-bond acceptor and donor strengths. A publication describing the implementation, fitting, and validation of Jazzy can be found at doi.org/10.1038/s41598-023-30089-x. These descriptors use the “MMFF94” minimisation method. NB: Jazzy employs a threshold of <50 Hydrogen acceptors/donors and Mw of <1000Da for input compounds.
- class Parameters(jazzy_names: Optional[List[str]] = None, jazzy_filters: Optional[Dict[str, Any]] = None, scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledJazzyDescriptors'>)[source]
- descriptor
alias of
UnscaledJazzyDescriptors
PrecomputedDescriptorFromFile
- class optunaz.descriptors.PrecomputedDescriptorFromFile(name, parameters)[source]
Precomputed descriptors.
Users can supply a CSV file of feature vectors to use as descriptors, with headers on the first line. Each row corresponds to a compound in the training set, followed by a column that may have comma-separated vectors describing that molecule.
- class Parameters(file=None, input_column=None, response_column=None)[source]
- Parameters:
file (str) – Name of the CSV containing precomputed descriptors - title: file
input_column (str) – Name of input column with SMILES strings - title: Input Column
response_column (str) – Name of response column with the comma-separated vectors that the model will use as pre-computed descriptors - title: Response column
ZScales
- class optunaz.descriptors.ZScalesDescriptors(parameters=ZScalesDescriptors.Parameters(scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledZScalesDescriptors'>), name='ZScalesDescriptors')[source]
Scaled Z-Scales descriptors.
Z-scales were proposed in Sandberg et al (1998) based on physicochemical properties of proteogenic and non-proteogenic amino acids, including NMR data and thin-layer chromatography (TLC) data. Refer to doi:10.1021/jm9700575 for the original publication. These descriptors capture 1. lipophilicity, 2. steric properties (steric bulk and polarizability), 3. electronic properties (polarity and charge), 4. electronegativity (heat of formation, electrophilicity and hardness) and 5. another electronegativity. This fingerprint is the computed average of Z-scales of all the amino acids in the peptide.
- class Parameters(scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledZScalesDescriptors'>)[source]
- descriptor
alias of
UnscaledZScalesDescriptors
SmilesFromFile
- class optunaz.descriptors.SmilesFromFile(name, parameters=SmilesFromFile.Parameters())[source]
Smiles as descriptors (for ChemProp).
ChemProp optimisation runs require either this or SmilesAndSideInfoFromFile descriptor to be selected. This setting allows the SMILES to pass through to the ChemProp package.
SmilesAndSideInfoFromFile
- class optunaz.descriptors.SmilesAndSideInfoFromFile(name, parameters)[source]
SMILES & side information descriptors (for ChemProp).
ChemProp optimisation requires either these or SmilesFromFile descriptors. This descriptor allows SMILES to pass through to ChemProp, _and_ for side information to be supplied as auxiliary tasks.
Side information can take the form of any vector (continuous or binary) which describe input compounds. All tasks are learnt in a multi-task manner to improve main-task (task of intent) predictions. Side information can boost performance since their contribution to network loss can lead to improved learnt molecular representations.
Optimal side information weighting (how much auxiliary tasks contribute to network loss) is also an (optional) learned parameter during optimisation.
Similar to PrecomputedDescriptorFromFile, CSV inputs for this descriptor should contain a SMILES column of input molecules. All vectors in the remaining columns are used as user-derived side-information (i.e: be cautious to only upload a CSV with side information tasks in columns since _all_ are used)
(see https://ruder.io/multi-task/index.html#auxiliarytasks for details).
- class Parameters(file=None, input_column=None, aux_weight_pc=SmilesAndSideInfoFromFile.Parameters.Aux_Weight_Pc(low=100, high=100, q=20))[source]
- Parameters:
file (str) – Name of the CSV containing precomputed side-info descriptors - title: file
input_column (str) – Name of input column with SMILES strings - title: Input Column
aux_weight_pc (Aux_Weight_Pc) – How much (%) auxiliary tasks (side information) contribute (%)to the loss function optimised during training. The larger the number, the larger the weight of side information. - title: Auxiliary weight percentage
ScaledDescriptor
- class optunaz.descriptors.ScaledDescriptor(parameters, name='ScaledDescriptor')[source]
Scaled Descriptor.
This descriptor is not a complete descriptor, but instead it wraps and scales another descriptor.
Some algorithms require input to be within certain range, e.g. [-1..1]. Some descriptors have different ranges for different columns/features. This descriptor wraps another descriptor and provides scaled values.
- class ScaledDescriptorParameters(descriptor: Union[Avalon, ECFP, ECFP_counts, PathFP, AmorProtDescriptors, MACCS_keys, PrecomputedDescriptorFromFile, UnscaledMAPC, UnscaledPhyschemDescriptors, UnscaledJazzyDescriptors, UnscaledZScalesDescriptors], scaler: Union[FittedSklearnScaler, UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'))[source]
CompositeDescriptor
- class optunaz.descriptors.CompositeDescriptor(parameters, name='CompositeDescriptor')[source]
Composite descriptor
Concatenates multiple descriptors into one. Select multiple algorithms from the button below. Please note the ChemProp SMILES descriptors are not compatible with this function.
- class Parameters(descriptors: List[Union[Avalon, ECFP, ECFP_counts, PathFP, AmorProtDescriptors, MACCS_keys, PrecomputedDescriptorFromFile, UnscaledMAPC, UnscaledPhyschemDescriptors, UnscaledJazzyDescriptors, UnscaledZScalesDescriptors, ScaledDescriptor, MAPC, PhyschemDescriptors, JazzyDescriptors, ZScalesDescriptors]])[source]
AmorProtDescriptors
UnscaledMAPC
- class optunaz.descriptors.UnscaledMAPC(name, parameters)[source]
Unscaled MAPC descriptors
These MAPC descriptors are unscaled and should be used with caution. MinHashed Atom-Pair Fingerprint Chiral (see Orsi et al. One chiral fingerprint to find them all) is the original version of the MinHashed Atom-Pair fingerprint of radius 2 (MAP4) which combined circular substructure fingerprints and atom-pair fingerprints into a unified framework. This combination allowed for improved substructure perception and performance in small molecule benchmarks while retaining information about bond distances for molecular size and shape perception.
These fingerprints expand the functionality of MAP4 to include encoding of stereochemistry into the fingerprint. CIP descriptors of chiral atoms are encoded into the fingerprint at the highest radius. This allows MAPC to modulate the impact of stereochemistry on fingerprints, making it scale with increasing molecular size without disproportionally affecting structural fingerprints/similarity.
UnscaledZScalesDescriptors
MAPC
- class optunaz.descriptors.MAPC(parameters=MAPC.Parameters(maxRadius=2, nPermutations=2048, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledMAPC'>), name='MAPC')[source]
Scaled MAPC descriptors
MAPC (MinHashed Atom-Pair Fingerprint Chiral) (see Orsi et al. One chiral fingerprint to find them all) is the original version of the MinHashed Atom-Pair fingerprint of radius 2 (MAP4) which combined circular substructure fingerprints and atom-pair fingerprints into a unified framework. This combination allowed for improved substructure perception and performance in small molecule benchmarks while retaining information about bond distances for molecular size and shape perception.
These fingerprints expand the functionality of MAP4 to include encoding of stereochemistry into the fingerprint. CIP descriptors of chiral atoms are encoded into the fingerprint at the highest radius. This allows MAPC to modulate the impact of stereochemistry on fingerprints, making it scale with increasing molecular size without disproportionally affecting structural fingerprints/similarity.
- class Parameters(maxRadius: int = 2, nPermutations: int = 2048, scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledMAPC'>)[source]
- descriptor
alias of
UnscaledMAPC
ZScalesDescriptors
- class optunaz.descriptors.ZScalesDescriptors(parameters=ZScalesDescriptors.Parameters(scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledZScalesDescriptors'>), name='ZScalesDescriptors')[source]
Scaled Z-Scales descriptors.
Z-scales were proposed in Sandberg et al (1998) based on physicochemical properties of proteogenic and non-proteogenic amino acids, including NMR data and thin-layer chromatography (TLC) data. Refer to doi:10.1021/jm9700575 for the original publication. These descriptors capture 1. lipophilicity, 2. steric properties (steric bulk and polarizability), 3. electronic properties (polarity and charge), 4. electronegativity (heat of formation, electrophilicity and hardness) and 5. another electronegativity. This fingerprint is the computed average of Z-scales of all the amino acids in the peptide.
- class Parameters(scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledZScalesDescriptors'>)[source]
- descriptor
alias of
UnscaledZScalesDescriptors