optunaz package
Subpackages
- optunaz.config package
- optunaz.utils package
- Subpackages
- optunaz.utils.enums package
- Submodules
- optunaz.utils.enums.building_configuration_enum module
- optunaz.utils.enums.configuration_enum module
- optunaz.utils.enums.interface_enum module
- optunaz.utils.enums.model_runner_enum module
- optunaz.utils.enums.objective_enum module
- optunaz.utils.enums.optimization_configuration_enum module
- optunaz.utils.enums.prediction_configuration_enum module
- optunaz.utils.enums.return_values_enum module
- optunaz.utils.enums.visualization_enum module
- Module contents
- optunaz.utils.preprocessing package
- optunaz.utils.enums package
- Submodules
- optunaz.utils.files_paths module
- optunaz.utils.load_json module
- optunaz.utils.mlflow module
- optunaz.utils.retraining module
- optunaz.utils.schema module
- optunaz.utils.tracking module
- Module contents
- Subpackages
Submodules
optunaz.automl module
- class optunaz.automl.ModelAutoML(output_path=None, input_data=None, n_cores=- 1, email=None, user_name=None, smiles_col=None, activity_col=None, task_col=None, dry_run=False, timestr='20241003-121308')[source]
Bases:
object
Prepares the data ready for the model training with ModelDispatcher. The ModelAutoML will also store activity for new tasks pending enough data.
- property first_run
- property processed_timepoints
- property last_timepoint
- getAllRetrainingData()[source]
Returns a dict of the wilcard data with converted datetime as the keys
- class optunaz.automl.ModelDispatcher(quorum=None, cfg=None, last_timepoint=None, initial_template=None, retrain_template=None, slurm_template=None, slurm_req_cores=1, slurm_req_partition=None, slurm_req_mem=None, slurm_al_pool=None, slurm_al_smiles=None, slurm_job_prefix=None, slurm_partition=None, save_previous_models=None, log_conf=None)[source]
Bases:
object
Use ModelAutoML config as a basis to prepare QSARtuna jobs, dispatching to SLURM. ModelDispatcher always needs a quorum to prepare the model
- property pretrained_model
Load a pretrained model
- checkIfRetrainingProcessed(taskcode)[source]
Checks if this timepoint has already been predicted (and therefore processed). Timepoints to be skipped with data but no model quorum will also be in .skipped dirs.
- checkisLocked(taskcode)[source]
Checks if this timepoint is locked for a given taskcode. Locks occur if QSARtuna is unable to run multiple retrain script instances run.
- processTrain(_taskcode_df)[source]
Opens existing training if possible, formats data and attributes set for prev data If no retrain, create directory, returns new smiles & y for train
- doTemporalPredictions(new_data)[source]
Start/check temporal (pseudo-prospective) predictions with an old QSARtuna model vs. newest data
optunaz.builder module
optunaz.convert module
optunaz.datareader module
- optunaz.datareader.read_data(filename, smiles_col='smiles', resp_col=None, response_type=None, aux_col=None, split_strategy=None)[source]
Reads data, drops NaNs and invalid SMILES. Supports SDF and CSV formats. In case of SDF - only response column has to be provided, since the smiles will be parsed from mol files inside.
Returns a tuple of ( SMILES (X), responses (Y), groups (groups) ).
- optunaz.datareader.deduplicate(smiles, y, aux, groups, deduplication_strategy, response_type)[source]
Removes duplicates based on RDKit canonical SMILES representation.
Returns a 2-tuple of original SMILES and deduplicated values.
In case there is an ambiguity which SMILES to return, as is the case for deduplication by averaging, returns canonical SMILES instead.
- class optunaz.datareader.Dataset(training_dataset_file, input_column, response_column, response_type=None, aux_column=None, aux_transform=None, deduplication_strategy=<factory>, split_strategy=<factory>, test_dataset_file=None, save_intermediate_files=False, intermediate_training_dataset_file=None, intermediate_test_dataset_file=None, log_transform=False, log_transform_base=None, log_transform_negative=None, log_transform_unit_conversion=None, probabilistic_threshold_representation=False, probabilistic_threshold_representation_threshold=None, probabilistic_threshold_representation_std=None)[source]
Bases:
object
Dataset.
Holds training data, optional test data, and names of input and response columns.
- training_dataset_file
- input_column
- response_column
- response_type = None
- aux_column = None
- aux_transform = None
- deduplication_strategy
- split_strategy
- test_dataset_file = None
- save_intermediate_files = False
- intermediate_training_dataset_file = None
- intermediate_test_dataset_file = None
- log_transform = False
- log_transform_base = None
- log_transform_negative = None
- log_transform_unit_conversion = None
- probabilistic_threshold_representation = False
- probabilistic_threshold_representation_threshold = None
- probabilistic_threshold_representation_std = None
optunaz.descriptors module
- exception optunaz.descriptors.ScalingFittingError(descriptor_str=None)[source]
Bases:
Exception
Raised when insufficient molecules for UnfittedSklearnSclaer to fit
- exception optunaz.descriptors.NoValidSmiles[source]
Bases:
Exception
Raised when no valid SMILES are available
- optunaz.descriptors.numpy_from_rdkit(fp, dtype)[source]
Returns Numpy representation of a given RDKit Fingerprint.
- class optunaz.descriptors.MolDescriptor[source]
Bases:
NameParameterDataclass
,ABC
Molecular Descriptors.
Descriptors can be fingerprints, but can also be custom user-specified descriptors.
Descriptors calculate a feature vector that will be used as input for predictive models (e.g. scikit-learn models).
- class optunaz.descriptors.RdkitDescriptor[source]
Bases:
MolDescriptor
,ABC
Abstract class for RDKit molecular descriptors (fingerprints).
- class optunaz.descriptors.AmorProtDescriptors(name, parameters)[source]
Bases:
MolDescriptor
These descriptors are intended to be used with Peptide SMILES
- class AmorProt(maccs=True, ecfp4=True, ecfp6=True, rdkit=True, W=10, A=10, R=0.85)
Bases:
object
- T(fp, p, W=10, A=10, R=0.85)
- fingerprint(seq)
- name
- parameters
- class optunaz.descriptors.Avalon(name, parameters)[source]
Bases:
RdkitDescriptor
Avalon Descriptor
Avalon (see Gedeck P, et al. QSAR-how good is it in practice?) uses a fingerprint generator in a similar to way to Daylight fingerprints, but enumerates with custom feature classes of the molecular graph ( see ref. paper for the 16 feature classes used). Hash codes for the path-style features are computed implicitly during enumeration. Avalon generated the largest number of good models in the reference study, which is likely since the fingerprint generator was tuned toward the features contained in the data set.
- class Parameters(nBits=2048)[source]
Bases:
object
- Parameters:
nBits (int) – Number of bits in the fingerprint, sometimes also called size. - minimum: 1, title: nBits
- nBits = 2048
- name
- parameters
- class optunaz.descriptors.ECFP(name, parameters)[source]
Bases:
RdkitDescriptor
Binary Extended Connectivity Fingerprint (ECFP).
ECFP (see Rogers et al. “Extended-Connectivity Fingerprints.”) [also known as Circular Fingerprints or Morgan Fingerprints], are built by applying the Morgan algorithm to a set of user-supplied atom invariants. This approach (implemented here using GetMorganFingerprintAsBitVect from RDKit) systematically records the neighborhood of each non-H atom into multiple circular layers up to a given radius (provided at runtime). The substructural features are mapped to integers using a hashing procedure (length of the hash provided at runtime). It is the set of the resulting identifiers that defines ECFPs. The diameter of the atom environments is appended to the name (e.g. ECFP4 corresponds to radius=2).
- class Parameters(radius=3, nBits=2048, returnRdkit=False)[source]
Bases:
object
- Parameters:
radius (int) – Radius of the atom environments considered. Note that the 4 in ECFP4 corresponds to the diameter of the atom environments considered, while here we use radius. For example, radius=2 would correspond to ECFP4. - minimum: 1, title: radius
nBits (int) – Number of bits in the fingerprint, sometimes also called size. - minimum: 1, title: nBits
returnRdkit (bool) –
- radius = 3
- nBits = 2048
- returnRdkit = False
- name
- parameters
- class optunaz.descriptors.ECFP_counts(name, parameters)[source]
Bases:
RdkitDescriptor
ECFP With Counts
Binary Extended Connectivity Fingerprint (ECFP) With Counts.
ECFP (see Rogers et al. “Extended-Connectivity Fingerprints.”) [also known as Circular Fingerprints or Morgan Fingerprints] With Counts are built similar to ECFP fingerprints, however this approach (implemented using GetHashedMorganFingerprint from RDKit) systematically records the count vectors rather than bit vectors. Bit vectors track whether features appear in a molecule while count vectors track the number of times each feature appears. The diameter of the atom environments is appended to the name (e.g. ECFP4 corresponds to radius=2).
- class Parameters(radius=3, useFeatures=True, nBits=2048)[source]
Bases:
object
- Parameters:
radius (int) – Radius of the atom environments considered. For ECFP4 (diameter=4) set radius=2 - minimum: 1, title: radius
useFeatures (bool) – Use feature fingerprints (FCFP), instead of normal ones (ECFP). RDKit feature definitions are adapted from the definitions in Gobbi & Poppinger, Biotechnology and Bioengineering 61, 47-54 (1998). FCFP and ECFP will likely lead to different fingerprints/similarity scores. - title: useFeatures
nBits (int) – Number of bits in the fingerprint, sometimes also called size. - minimum: 1, title: nBits
- radius = 3
- useFeatures = True
- nBits = 2048
- name
- parameters
- class optunaz.descriptors.PathFP(name, parameters)[source]
Bases:
RdkitDescriptor
Path fingerprint based on RDKit FP Generator.
This is a Path fingerprint.
- class Parameters(maxPath=3, fpSize=2048)[source]
Bases:
object
- Parameters:
maxPath (int) – Maximum path for the fingerprint - minimum: 1, title: maxPath
fpSize (int) – Number size of the fingerprint, sometimes also called bit size. - minimum: 1, title: fpSize
- maxPath = 3
- fpSize = 2048
- name
- parameters
- class optunaz.descriptors.MACCS_keys(name, parameters=MACCS_keys.Parameters())[source]
Bases:
RdkitDescriptor
MACCS
Molecular Access System (MACCS) fingerprint.
MACCS fingerprints (often referred to as MDL keys after the developing company ) are calculated using keysets originally constructed and optimized for substructure searching (see Durant et al. Reoptimization of MDL keys for use in drug discovery) are 166-bit 2D structure fingerprints.
Essentially, they are a binary fingerprint (zeros and ones) that answer 166 fragment related questions. If the explicitly defined fragment exists in the structure, the bit in that position is set to 1, and if not, it is set to 0. In that sense, the position of the bit matters because it is addressed to a specific question or a fragment. An atom can belong to multiple MACCS keys, and since each bit is binary, MACCS 166 keys can represent more than 9.3×1049 distinct fingerprint vectors.
- name
- parameters = MACCS_keys.Parameters()
- class optunaz.descriptors.UnfittedSklearnScaler(mol_data: MolData = UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name: Literal['UnfittedSklearnScaler'] = 'UnfittedSklearnScaler')[source]
Bases:
object
- class MolData(file_path: pathlib.Path = None, smiles_column: str = None)[source]
Bases:
object
- file_path = None
- smiles_column = None
- mol_data = UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None)
- name = 'UnfittedSklearnScaler'
- class optunaz.descriptors.FittedSklearnScaler(saved_params: str, name: Literal['FittedSklearnScaler'] = 'FittedSklearnScaler')[source]
Bases:
object
- saved_params
- name = 'FittedSklearnScaler'
- class optunaz.descriptors.UnscaledMAPC(name, parameters)[source]
Bases:
RdkitDescriptor
Unscaled MAPC descriptors
These MAPC descriptors are unscaled and should be used with caution. MinHashed Atom-Pair Fingerprint Chiral (see Orsi et al. One chiral fingerprint to find them all) is the original version of the MinHashed Atom-Pair fingerprint of radius 2 (MAP4) which combined circular substructure fingerprints and atom-pair fingerprints into a unified framework. This combination allowed for improved substructure perception and performance in small molecule benchmarks while retaining information about bond distances for molecular size and shape perception.
These fingerprints expand the functionality of MAP4 to include encoding of stereochemistry into the fingerprint. CIP descriptors of chiral atoms are encoded into the fingerprint at the highest radius. This allows MAPC to modulate the impact of stereochemistry on fingerprints, making it scale with increasing molecular size without disproportionally affecting structural fingerprints/similarity.
- class Parameters(maxRadius=2, nPermutations=2048)[source]
Bases:
object
- Parameters:
maxRadius (int) – Maximum radius of the fingerprint. - minimum: 1, title: maxRadius
nPermutations (int) – Number of permutations to perform. - minimum: 1, title: nPermutations
- maxRadius = 2
- nPermutations = 2048
- name
- parameters
- class optunaz.descriptors.UnscaledPhyschemDescriptors(name='UnscaledPhyschemDescriptors', parameters=UnscaledPhyschemDescriptors.Parameters(rdkit_names=None))[source]
Bases:
RdkitDescriptor
Base (unscaled) PhyschemDescriptors (RDKit) for PhyschemDescriptors
These physchem descriptors are unscaled and should be used with caution. They are a set of 208 physchem/molecular properties that are calculated in RDKit and used as descriptor vectors for input molecules. Features include ClogP, MW, # of atoms, rings, rotatable bonds, fraction sp3 C, graph invariants (Kier indices etc), TPSA, Slogp descriptors, counts of some functional groups, VSA MOE-type descriptors, estimates of atomic charges etc. (See https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors).
Vectors whose components are molecular descriptors have been used as high-level feature representations for molecular machine learning. One advantage of molecular descriptor vectors is their interpretability, since the meaning of a physicochemical descriptor can be intuitively understood
- class Parameters(rdkit_names: Optional[List[str]] = None)[source]
Bases:
object
- rdkit_names = None
- name = 'UnscaledPhyschemDescriptors'
- parameters = UnscaledPhyschemDescriptors.Parameters(rdkit_names=None)
- class optunaz.descriptors.UnscaledJazzyDescriptors(name='UnscaledJazzyDescriptors', parameters=UnscaledJazzyDescriptors.Parameters(jazzy_names=None, jazzy_filters=None))[source]
Bases:
MolDescriptor
Base (unscaled) Jazzy descriptors
These Jazzy descriptors are unscaled and should be used with caution. They offer a molecular vector describing the hydration free energies and hydrogen-bond acceptor and donor strengths. A publication describing the implementation, fitting, and validation of Jazzy can be found at doi.org/10.1038/s41598-023-30089-x. These descriptors use the “MMFF94” minimisation method. NB: this descriptor employs a threshold of <50 Hydrogen acceptors/donors and a Mw of <1000Da for compound inputs.
- class Parameters(jazzy_names: Optional[List[str]] = None, jazzy_filters: Optional[Dict[str, Any]] = None)[source]
Bases:
object
- jazzy_names = None
- jazzy_filters = None
- name = 'UnscaledJazzyDescriptors'
- parameters = UnscaledJazzyDescriptors.Parameters(jazzy_names=None, jazzy_filters=None)
- class optunaz.descriptors.UnscaledZScalesDescriptors(name='UnscaledZScalesDescriptors', parameters=UnscaledZScalesDescriptors.Parameters())[source]
Bases:
MolDescriptor
Unscaled Z-Scales.
Compute the Z-scales of a peptide SMILES. These Z-Scales descriptors are unscaled and should be used with caution.
- name = 'UnscaledZScalesDescriptors'
- parameters = UnscaledZScalesDescriptors.Parameters()
- class optunaz.descriptors.PrecomputedDescriptorFromFile(name, parameters)[source]
Bases:
MolDescriptor
Precomputed descriptors.
Users can supply a CSV file of feature vectors to use as descriptors, with headers on the first line. Each row corresponds to a compound in the training set, followed by a column that may have comma-separated vectors describing that molecule.
- class Parameters(file=None, input_column=None, response_column=None)[source]
Bases:
object
- Parameters:
file (str) – Name of the CSV containing precomputed descriptors - title: file
input_column (str) – Name of input column with SMILES strings - title: Input Column
response_column (str) – Name of response column with the comma-separated vectors that the model will use as pre-computed descriptors - title: Response column
- file = None
- input_column = None
- response_column = None
- name
- parameters
- class optunaz.descriptors.SmilesFromFile(name, parameters=SmilesFromFile.Parameters())[source]
Bases:
MolDescriptor
Smiles as descriptors (for ChemProp).
ChemProp optimisation runs require either this or SmilesAndSideInfoFromFile descriptor to be selected. This setting allows the SMILES to pass through to the ChemProp package.
- name
- parameters = SmilesFromFile.Parameters()
- class optunaz.descriptors.SmilesAndSideInfoFromFile(name, parameters)[source]
Bases:
MolDescriptor
SMILES & side information descriptors (for ChemProp).
ChemProp optimisation requires either these or SmilesFromFile descriptors. This descriptor allows SMILES to pass through to ChemProp, _and_ for side information to be supplied as auxiliary tasks.
Side information can take the form of any vector (continuous or binary) which describe input compounds. All tasks are learnt in a multi-task manner to improve main-task (task of intent) predictions. Side information can boost performance since their contribution to network loss can lead to improved learnt molecular representations.
Optimal side information weighting (how much auxiliary tasks contribute to network loss) is also an (optional) learned parameter during optimisation.
Similar to PrecomputedDescriptorFromFile, CSV inputs for this descriptor should contain a SMILES column of input molecules. All vectors in the remaining columns are used as user-derived side-information (i.e: be cautious to only upload a CSV with side information tasks in columns since _all_ are used)
(see https://ruder.io/multi-task/index.html#auxiliarytasks for details).
- class Parameters(file=None, input_column=None, aux_weight_pc=SmilesAndSideInfoFromFile.Parameters.Aux_Weight_Pc(low=100, high=100, q=20))[source]
Bases:
object
- Parameters:
file (str) – Name of the CSV containing precomputed side-info descriptors - title: file
input_column (str) – Name of input column with SMILES strings - title: Input Column
aux_weight_pc (Aux_Weight_Pc) – How much (%) auxiliary tasks (side information) contribute (%)to the loss function optimised during training. The larger the number, the larger the weight of side information. - title: Auxiliary weight percentage
- class Aux_Weight_Pc(low: int = 100, high: int = 100, q: int = 20)[source]
Bases:
object
- low = 100
- high = 100
- q = 20
- file = None
- input_column = None
- aux_weight_pc = SmilesAndSideInfoFromFile.Parameters.Aux_Weight_Pc(low=100, high=100, q=20)
- name
- parameters
- class optunaz.descriptors.ScaledDescriptor(parameters, name='ScaledDescriptor')[source]
Bases:
MolDescriptor
Scaled Descriptor.
This descriptor is not a complete descriptor, but instead it wraps and scales another descriptor.
Some algorithms require input to be within certain range, e.g. [-1..1]. Some descriptors have different ranges for different columns/features. This descriptor wraps another descriptor and provides scaled values.
- class ScaledDescriptorParameters(descriptor: Union[Avalon, ECFP, ECFP_counts, PathFP, AmorProtDescriptors, MACCS_keys, PrecomputedDescriptorFromFile, UnscaledMAPC, UnscaledPhyschemDescriptors, UnscaledJazzyDescriptors, UnscaledZScalesDescriptors], scaler: Union[FittedSklearnScaler, UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'))[source]
Bases:
object
- descriptor
- scaler = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler')
- parameters
- name = 'ScaledDescriptor'
- class optunaz.descriptors.PhyschemDescriptors(parameters=PhyschemDescriptors.Parameters(rdkit_names=None, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledPhyschemDescriptors'>), name='PhyschemDescriptors')[source]
Bases:
ScaledDescriptor
PhyschemDescriptors (scaled) calculated in RDKit
A set of 208 physchem/molecular properties that are calculated in RDKit and used as descriptor vectors for input molecules. Features include ClogP, MW, # of atoms, rings, rotatable bonds, fraction sp3 C, graph invariants (Kier indices etc), TPSA, Slogp descriptors, counts of some functional groups, VSA MOE-type descriptors, estimates of atomic charges etc. (See https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors).
Vectors whose components are molecular descriptors have been used as high-level feature representations for molecular machine learning. One advantage of molecular descriptor vectors is their interpretability, since the meaning of a physicochemical descriptor can be intuitively understood
- class Parameters(rdkit_names: Optional[List[str]] = None, scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledPhyschemDescriptors'>)[source]
Bases:
object
- rdkit_names = None
- scaler = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler')
- descriptor
alias of
UnscaledPhyschemDescriptors
- parameters = PhyschemDescriptors.Parameters(rdkit_names=None, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledPhyschemDescriptors'>)
- name = 'PhyschemDescriptors'
- class optunaz.descriptors.JazzyDescriptors(parameters=JazzyDescriptors.Parameters(jazzy_names=None, jazzy_filters=None, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledJazzyDescriptors'>), name='JazzyDescriptors')[source]
Bases:
ScaledDescriptor
Scaled Jazzy descriptors
Jazzy descriptors offer a molecular vector describing the hydration free energies and hydrogen-bond acceptor and donor strengths. A publication describing the implementation, fitting, and validation of Jazzy can be found at doi.org/10.1038/s41598-023-30089-x. These descriptors use the “MMFF94” minimisation method. NB: Jazzy employs a threshold of <50 Hydrogen acceptors/donors and Mw of <1000Da for input compounds.
- class Parameters(jazzy_names: Optional[List[str]] = None, jazzy_filters: Optional[Dict[str, Any]] = None, scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledJazzyDescriptors'>)[source]
Bases:
object
- jazzy_names = None
- jazzy_filters = None
- scaler = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler')
- descriptor
alias of
UnscaledJazzyDescriptors
- name = 'JazzyDescriptors'
- parameters = JazzyDescriptors.Parameters(jazzy_names=None, jazzy_filters=None, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledJazzyDescriptors'>)
- class optunaz.descriptors.MAPC(parameters=MAPC.Parameters(maxRadius=2, nPermutations=2048, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledMAPC'>), name='MAPC')[source]
Bases:
ScaledDescriptor
Scaled MAPC descriptors
MAPC (MinHashed Atom-Pair Fingerprint Chiral) (see Orsi et al. One chiral fingerprint to find them all) is the original version of the MinHashed Atom-Pair fingerprint of radius 2 (MAP4) which combined circular substructure fingerprints and atom-pair fingerprints into a unified framework. This combination allowed for improved substructure perception and performance in small molecule benchmarks while retaining information about bond distances for molecular size and shape perception.
These fingerprints expand the functionality of MAP4 to include encoding of stereochemistry into the fingerprint. CIP descriptors of chiral atoms are encoded into the fingerprint at the highest radius. This allows MAPC to modulate the impact of stereochemistry on fingerprints, making it scale with increasing molecular size without disproportionally affecting structural fingerprints/similarity.
- class Parameters(maxRadius: int = 2, nPermutations: int = 2048, scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledMAPC'>)[source]
Bases:
object
- maxRadius = 2
- nPermutations = 2048
- scaler = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler')
- descriptor
alias of
UnscaledMAPC
- name = 'MAPC'
- parameters = MAPC.Parameters(maxRadius=2, nPermutations=2048, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledMAPC'>)
- class optunaz.descriptors.ZScalesDescriptors(parameters=ZScalesDescriptors.Parameters(scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledZScalesDescriptors'>), name='ZScalesDescriptors')[source]
Bases:
ScaledDescriptor
Scaled Z-Scales descriptors.
Z-scales were proposed in Sandberg et al (1998) based on physicochemical properties of proteogenic and non-proteogenic amino acids, including NMR data and thin-layer chromatography (TLC) data. Refer to doi:10.1021/jm9700575 for the original publication. These descriptors capture 1. lipophilicity, 2. steric properties (steric bulk and polarizability), 3. electronic properties (polarity and charge), 4. electronegativity (heat of formation, electrophilicity and hardness) and 5. another electronegativity. This fingerprint is the computed average of Z-scales of all the amino acids in the peptide.
- class Parameters(scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledZScalesDescriptors'>)[source]
Bases:
object
- scaler = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler')
- descriptor
alias of
UnscaledZScalesDescriptors
- name = 'ZScalesDescriptors'
- parameters = ZScalesDescriptors.Parameters(scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledZScalesDescriptors'>)
- class optunaz.descriptors.CompositeDescriptor(parameters, name='CompositeDescriptor')[source]
Bases:
MolDescriptor
Composite descriptor
Concatenates multiple descriptors into one. Select multiple algorithms from the button below. Please note the ChemProp SMILES descriptors are not compatible with this function.
- class Parameters(descriptors: List[Union[Avalon, ECFP, ECFP_counts, PathFP, AmorProtDescriptors, MACCS_keys, PrecomputedDescriptorFromFile, UnscaledMAPC, UnscaledPhyschemDescriptors, UnscaledJazzyDescriptors, UnscaledZScalesDescriptors, ScaledDescriptor, MAPC, PhyschemDescriptors, JazzyDescriptors, ZScalesDescriptors]])[source]
Bases:
object
- descriptors
- parameters
- name = 'CompositeDescriptor'
- class optunaz.descriptors.CanonicalSmiles(name, parameters=CanonicalSmiles.Parameters())[source]
Bases:
MolDescriptor
Canonical Smiles for use in utility functions (not for user selection).
- name
- parameters = CanonicalSmiles.Parameters()
- class optunaz.descriptors.Scaffold(name, parameters=Scaffold.Parameters())[source]
Bases:
MolDescriptor
Scaffold Smiles for use in utility functions (not for user selection).
- name
- parameters = Scaffold.Parameters()
- class optunaz.descriptors.GenericScaffold(name, parameters=GenericScaffold.Parameters())[source]
Bases:
MolDescriptor
Generic Scaffold Smiles for use in utility functions (not for user selection).
- name
- parameters = GenericScaffold.Parameters()
- class optunaz.descriptors.ValidDescriptor(name, parameters=ValidDescriptor.Parameters())[source]
Bases:
MolDescriptor
Validates Smiles for use in utility functions (not for user selection).
- name
- parameters = ValidDescriptor.Parameters()
- optunaz.descriptors.descriptor_from_config(smiles, descriptor, cache=None, return_failed_idx=True)[source]
Returns molecular descriptors (fingerprints) for a given set of SMILES and configuration.
When return_failed_idx is True, this returns a 2d numpy array and valid indices for that descriptor When return_failed_idx is False, this returns the raw descriptor output (e.g. for canonical smiles etc)
optunaz.evaluate module
optunaz.explainability module
- optunaz.explainability.get_ecfp_fpinfo(m, descriptor)[source]
Return the ecfp info for a compound mol
- optunaz.explainability.get_ecfpcount_fpinfo(m, descriptor)[source]
Return the ecfp_count info for a compound mol
- optunaz.explainability.explain_ECFP(len_feats, estimator, descriptor)[source]
Explain ECFPs using train atom environments
- optunaz.explainability.get_fp_info(exp_df, estimator, descript, fp_idx, strt_idx=None)[source]
Get ECFP SMILES environments or Physchem names when available
- optunaz.explainability.runShap(estimator, X_pred, mode)[source]
Explain model prediction using auto explainer or SHAP KernelExplainer
optunaz.metircs module
- optunaz.metircs.validate_cls_input(y_true, y_pred)[source]
Validate true and predicted arrays for metrics.
- optunaz.metircs.auc_pr_cal(y_true, y_pred, pi_zero=0.1)[source]
Compute calibrated AUC PR metric.
Implemented according to MELLODDY SparseChem https://github.com/melloddy/SparseChem. Calibration modifies the AUC PR to account for class imbalance.
- optunaz.metircs.bedroc_score(y_true, y_pred, alpha=20.0)[source]
Compute BEDROC metric.
Implemented according to Truchon, J. & Bayly, C.I. Evaluating Virtual Screening Methods: Good and Bad Metric for the “Early Recognition” Problem. J. Chem. Inf. Model. 47, 488-508 (2007).
- optunaz.metircs.concordance_index(y_true, y_pred)[source]
Compute Concordance index.
Statistical metric to indicate the quality of a predicted ranking based on Harald, et al. “On ranking in survival analysis: Bounds on the concordance index.” Advances in neural information processing systems (2008): 1209-1216.
optunaz.model_writer module
- class optunaz.model_writer.Predictor[source]
Bases:
ABC
Interface definition for scikit-learn/chemprop Predictor.
Scikit-learn does not define a class that describes the Predictor interface. Instead, scikit-learn describes in text that Predictor should have method ‘predict’, and optionally ‘predict_proba’: https://scikit-learn.org/stable/developers/develop.html#apis-of-scikit-learn-objects
This class describes this interface as an abstract Python class, for convenience and better type checking.
- predict_proba(data)[source]
For Classification algorithms, returns algorithmic posterior of a prediction.
This method is optional, and is not marked with @abstractmethod. This method is optional, and is not marked with @abstractmethod.
- class optunaz.model_writer.QSARtunaModel(predictor: Predictor, descriptor: Union[Avalon, ECFP, ECFP_counts, PathFP, AmorProtDescriptors, MACCS_keys, PrecomputedDescriptorFromFile, UnscaledMAPC, UnscaledPhyschemDescriptors, UnscaledJazzyDescriptors, UnscaledZScalesDescriptors, ScaledDescriptor, MAPC, PhyschemDescriptors, JazzyDescriptors, ZScalesDescriptors, CompositeDescriptor, SmilesFromFile, SmilesAndSideInfoFromFile], mode: ModelMode, transform: Optional[ModelDataTransform] = None, aux_transform: Union[VectorFromColumn, ZScales, AmorProt, NoneType] = None, metadata: Optional[Dict] = None)[source]
Bases:
ABC
- predictor
- descriptor
- mode
- transform = None
- aux_transform = None
- metadata = None
- predict_from_smiles(smiles, aux=None, uncert=False, explain=False, transform='default', aux_transform=None)[source]
Returns model predictions for the input SMILES strings.
If some input smiles are invalid for the descriptor, in which case the descriptor returns None, those None values are not sent to the model; instead, NaN is used as predicted values for those invalid SMILES.
- optunaz.model_writer.get_metadata(buildconfig, train_scores, test_scores)[source]
Metadata for a predictive model.
optunaz.objective module
optunaz.optbuild module
- optunaz.optbuild.build_with_al(model_path, inference_path, mode)[source]
Active learning inference which can occur with buiding
optunaz.predict module
- exception optunaz.predict.ArgsError[source]
Bases:
Exception
Thrown when there is an issue with basic args at inference time
- exception optunaz.predict.UncertaintyError[source]
Bases:
Exception
Thrown when uncertainty parameters are not set correctly at inference
- exception optunaz.predict.AuxCovariateMissing[source]
Bases:
Exception
Thrown when a model is trained using Auxiliary (covariate) data which is not supplied at inference
optunaz.schemagen module
- optunaz.schemagen.type_base_schema(tp)[source]
Adds title and description from docstrings.
See https://wyfo.github.io/apischema/0.16/json_schema/#base-schema
optunaz.three_step_opt_build_merge module
- optunaz.three_step_opt_build_merge.split_optimize(optconfig)[source]
Split Hyperparameter runs into non-chemprop and chemprop runs for Optuna.
- optunaz.three_step_opt_build_merge.base_chemprop_params(alg)[source]
Used to enqueue an initial ChemProp run that captures sensible defaults as defined by original authors. A Check is performed to ensure any parameters outside valid Optuna subspace are popped from fixed parameters.
- optunaz.three_step_opt_build_merge.run_study(optconfig, study_name, objective, n_startup_trials, n_trials, seed, storage=True, trial_number_offset=0)[source]
Run an Optuna study
- optunaz.three_step_opt_build_merge.optimize(optconfig, study_name=None)[source]
Step 1. Hyperparameter optimization using Optuna.