optunaz package

Subpackages

Submodules

optunaz.automl module

class optunaz.automl.ModelAutoML(output_path=None, input_data=None, n_cores=- 1, email=None, user_name=None, smiles_col=None, activity_col=None, task_col=None, dry_run=False, timestr='20240828-174523')[source]

Bases: object

Prepares the data ready for the model training with ModelDispatcher. The ModelAutoML will also store activity for new tasks pending enough data.

property first_run
property processed_timepoints
property last_timepoint
getAllRetrainingData()[source]

Returns a dict of the wilcard data with converted datetime as the keys

getRetrainingData()[source]

Get data for the latest unprocessed date bucket or raise NoNewRetrainingData if none

setRetrainingData()[source]

Sets the newest data bucket and timepoint for latest available data

initProcessedTimepoints()[source]

Initialise the JSON containing timepoints for a first run

setProcessedTimepoints(problem=None)[source]

Set the processed timepoints and the currently processing timepoint to JSON

class optunaz.automl.ModelDispatcher(quorum=None, cfg=None, last_timepoint=None, initial_template=None, retrain_template=None, slurm_template=None, slurm_req_cores=1, slurm_req_partition=None, slurm_req_mem=None, slurm_al_pool=None, slurm_al_smiles=None, slurm_job_prefix=None, slurm_partition=None, save_previous_models=None, log_conf=None)[source]

Bases: object

Use ModelAutoML config as a basis to prepare QSARtuna jobs, dispatching to SLURM. ModelDispatcher always needs a quorum to prepare the model

property pretrained_model

Load a pretrained model

checkIfRetrainingProcessed(taskcode)[source]

Checks if this timepoint has already been predicted (and therefore processed). Timepoints to be skipped with data but no model quorum will also be in .skipped dirs.

checkisLocked(taskcode)[source]

Checks if this timepoint is locked for a given taskcode. Locks occur if QSARtuna is unable to run multiple retrain script instances run.

checkRunningSlurmJobs()[source]
static calcSlurmMem(len_file)[source]

Dynamic resource allocation for memory from query

setDispatcherVariables(taskcode)[source]

Sets environment variables on a per taskcode level

setJobLocked()[source]

Creates lock file to ensure future runs do not overwrite pending jobs

processTrain(_taskcode_df)[source]

Opens existing training if possible, formats data and attributes set for prev data If no retrain, create directory, returns new smiles & y for train

processQuorum(_input_df)[source]

Evaluates quorum & formats retraining data

isTrained()[source]
checkSaveTemporalModel()[source]
doTemporalPredictions(new_data)[source]

Start/check temporal (pseudo-prospective) predictions with an old QSARtuna model vs. newest data

writeSlurm()[source]

Writes a slurm job for a QSARtuna run for a given taskcode

writeJson()[source]

Writes a QSARtuna json for a given taskcode

writeDataset(out_df)[source]

Writes the training datapoints to file

setSkippedTimepoint()[source]

Annotate the timepoint as not eligable for a taskcode

checkSkipped()[source]
submitJob()[source]
checkSlurmStatusAndNextProcedure()[source]

Check a SLURM job completed with no cancellations

increaseJobTime(minutes)[source]

Increase SLURM model time

increaseJobMem(mem, max_mem=200)[source]

Increase SLURM model memory

increaseJobCpu(cpu, max_cpu=20)[source]

Increase SLURM model cpu

addSlurmRetry()[source]
getSlurmRetry()[source]
resubmitAnyFailedJobs(locked_jobs, minutes=720, mem=20, cpu=4, max_retries=5, max_mem=200, max_cpu=20)[source]

Resubmit failed jobs, according to reason for failure

processRetraining(taskcode)[source]

Enumerates through new data, creating the latest files and models

optunaz.automl.process_retraining_task(taskcode, dispatcher)[source]
optunaz.automl.dispatcher_process(global_cfg, args, dispatcher)[source]
optunaz.automl.meta()[source]

Tracks temporal performance of QSARtuna models by writing the metadata to JSON files

optunaz.automl.validate_args(args)[source]
optunaz.automl.prepare_dispatcher(global_cfg, args, log_conf)[source]
optunaz.automl.main()[source]

optunaz.builder module

optunaz.builder.build(buildconfig, merge_train_and_test_data=False, cache=None)[source]

Build regressor or classifier model and return it.

optunaz.datareader module

optunaz.datareader.isvalid(smiles)[source]
optunaz.datareader.read_data(filename, smiles_col='smiles', resp_col=None, response_type=None, aux_col=None, split_strategy=None)[source]

Reads data, drops NaNs and invalid SMILES. Supports SDF and CSV formats. In case of SDF - only response column has to be provided, since the smiles will be parsed from mol files inside.

Returns a tuple of ( SMILES (X), responses (Y), groups (groups) ).

optunaz.datareader.deduplicate(smiles, y, aux, groups, deduplication_strategy, response_type)[source]

Removes duplicates based on RDKit canonical SMILES representation.

Returns a 2-tuple of original SMILES and deduplicated values.

In case there is an ambiguity which SMILES to return, as is the case for deduplication by averaging, returns canonical SMILES instead.

optunaz.datareader.split(X, y, aux, strategy, groups)[source]
optunaz.datareader.merge(train_smiles, train_y, train_aux, test_smiles, test_y, test_aux)[source]
optunaz.datareader.transform(smiles_, y_, aux_, transform)[source]
class optunaz.datareader.Dataset(training_dataset_file, input_column, response_column, response_type=None, aux_column=None, aux_transform=None, deduplication_strategy=<factory>, split_strategy=<factory>, test_dataset_file=None, save_intermediate_files=False, intermediate_training_dataset_file=None, intermediate_test_dataset_file=None, log_transform=False, log_transform_base=None, log_transform_negative=None, log_transform_unit_conversion=None, probabilistic_threshold_representation=False, probabilistic_threshold_representation_threshold=None, probabilistic_threshold_representation_std=None)[source]

Bases: object

Dataset.

Holds training data, optional test data, and names of input and response columns.

training_dataset_file
input_column
response_column
response_type = None
aux_column = None
aux_transform = None
deduplication_strategy
split_strategy
test_dataset_file = None
save_intermediate_files = False
intermediate_training_dataset_file = None
intermediate_test_dataset_file = None
log_transform = False
log_transform_base = None
log_transform_negative = None
log_transform_unit_conversion = None
probabilistic_threshold_representation = False
probabilistic_threshold_representation_threshold = None
probabilistic_threshold_representation_std = None
get_sets()[source]

Returns training and test datasets.

get_merged_sets()[source]

Returns merged training+test datasets.

check_sets()[source]

Check sets are valid

optunaz.descriptors module

exception optunaz.descriptors.ScalingFittingError(descriptor_str=None)[source]

Bases: Exception

Raised when insufficient molecules for UnfittedSklearnSclaer to fit

exception optunaz.descriptors.NoValidSmiles[source]

Bases: Exception

Raised when no valid SMILES are available

optunaz.descriptors.mol_from_smi(smi)[source]
optunaz.descriptors.numpy_from_rdkit(fp, dtype)[source]

Returns Numpy representation of a given RDKit Fingerprint.

class optunaz.descriptors.MolDescriptor[source]

Bases: NameParameterDataclass, ABC

Molecular Descriptors.

Descriptors can be fingerprints, but can also be custom user-specified descriptors.

Descriptors calculate a feature vector that will be used as input for predictive models (e.g. scikit-learn models).

abstract calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

parallel_compute_descriptor(smiles, n_cores=None, cache=None)[source]

Use python Parallel to compute descriptor (e.g. a fingerprint) for a given SMILES string.

Can be used to generate descriptors in parallel and/or with a cache

class optunaz.descriptors.RdkitDescriptor[source]

Bases: MolDescriptor, ABC

Abstract class for RDKit molecular descriptors (fingerprints).

abstract calculate_from_mol(mol)[source]

Returns a descriptor (fingerprint) for a given RDKit Mol as a 1-d Numpy array.

calculate_from_smi(smi)[source]

Returns a descriptor (fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

Returns None if input SMILES string is not valid according to RDKit.

class optunaz.descriptors.AmorProtDescriptors(name, parameters)[source]

Bases: MolDescriptor

These descriptors are intended to be used with Peptide SMILES

class AmorProt(maccs=True, ecfp4=True, ecfp6=True, rdkit=True, W=10, A=10, R=0.85)

Bases: object

T(fp, p, W=10, A=10, R=0.85)
fingerprint(seq)
class Parameters[source]

Bases: object

name
parameters
calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

class optunaz.descriptors.Avalon(name, parameters)[source]

Bases: RdkitDescriptor

Avalon Descriptor

Avalon (see Gedeck P, et al. QSAR-how good is it in practice?) uses a fingerprint generator in a similar to way to Daylight fingerprints, but enumerates with custom feature classes of the molecular graph ( see ref. paper for the 16 feature classes used). Hash codes for the path-style features are computed implicitly during enumeration. Avalon generated the largest number of good models in the reference study, which is likely since the fingerprint generator was tuned toward the features contained in the data set.

class Parameters(nBits=2048)[source]

Bases: object

Parameters:

nBits (int) – Number of bits in the fingerprint, sometimes also called size. - minimum: 1, title: nBits

nBits = 2048
name
parameters
calculate_from_mol(mol)[source]

Returns a descriptor (fingerprint) for a given RDKit Mol as a 1-d Numpy array.

class optunaz.descriptors.ECFP(name, parameters)[source]

Bases: RdkitDescriptor

Binary Extended Connectivity Fingerprint (ECFP).

ECFP (see Rogers et al. “Extended-Connectivity Fingerprints.”) [also known as Circular Fingerprints or Morgan Fingerprints], are built by applying the Morgan algorithm to a set of user-supplied atom invariants. This approach (implemented here using GetMorganFingerprintAsBitVect from RDKit) systematically records the neighborhood of each non-H atom into multiple circular layers up to a given radius (provided at runtime). The substructural features are mapped to integers using a hashing procedure (length of the hash provided at runtime). It is the set of the resulting identifiers that defines ECFPs. The diameter of the atom environments is appended to the name (e.g. ECFP4 corresponds to radius=2).

class Parameters(radius=3, nBits=2048, returnRdkit=False)[source]

Bases: object

Parameters:
  • radius (int) – Radius of the atom environments considered. Note that the 4 in ECFP4 corresponds to the diameter of the atom environments considered, while here we use radius. For example, radius=2 would correspond to ECFP4. - minimum: 1, title: radius

  • nBits (int) – Number of bits in the fingerprint, sometimes also called size. - minimum: 1, title: nBits

  • returnRdkit (bool) –

radius = 3
nBits = 2048
returnRdkit = False
name
parameters
calculate_from_mol(mol)[source]

Returns a descriptor (fingerprint) for a given RDKit Mol as a 1-d Numpy array.

class optunaz.descriptors.ECFP_counts(name, parameters)[source]

Bases: RdkitDescriptor

ECFP With Counts

Binary Extended Connectivity Fingerprint (ECFP) With Counts.

ECFP (see Rogers et al. “Extended-Connectivity Fingerprints.”) [also known as Circular Fingerprints or Morgan Fingerprints] With Counts are built similar to ECFP fingerprints, however this approach (implemented using GetHashedMorganFingerprint from RDKit) systematically records the count vectors rather than bit vectors. Bit vectors track whether features appear in a molecule while count vectors track the number of times each feature appears. The diameter of the atom environments is appended to the name (e.g. ECFP4 corresponds to radius=2).

class Parameters(radius=3, useFeatures=True, nBits=2048)[source]

Bases: object

Parameters:
  • radius (int) – Radius of the atom environments considered. For ECFP4 (diameter=4) set radius=2 - minimum: 1, title: radius

  • useFeatures (bool) – Use feature fingerprints (FCFP), instead of normal ones (ECFP). RDKit feature definitions are adapted from the definitions in Gobbi & Poppinger, Biotechnology and Bioengineering 61, 47-54 (1998). FCFP and ECFP will likely lead to different fingerprints/similarity scores. - title: useFeatures

  • nBits (int) – Number of bits in the fingerprint, sometimes also called size. - minimum: 1, title: nBits

radius = 3
useFeatures = True
nBits = 2048
name
parameters
calculate_from_mol(mol)[source]

Returns a descriptor (fingerprint) for a given RDKit Mol as a 1-d Numpy array.

class optunaz.descriptors.PathFP(name, parameters)[source]

Bases: RdkitDescriptor

Path fingerprint based on RDKit FP Generator.

This is a Path fingerprint.

class Parameters(maxPath=3, fpSize=2048)[source]

Bases: object

Parameters:
  • maxPath (int) – Maximum path for the fingerprint - minimum: 1, title: maxPath

  • fpSize (int) – Number size of the fingerprint, sometimes also called bit size. - minimum: 1, title: fpSize

maxPath = 3
fpSize = 2048
name
parameters
calculate_from_mol(mol)[source]

Returns a descriptor (fingerprint) for a given RDKit Mol as a 1-d Numpy array.

class optunaz.descriptors.MACCS_keys(name, parameters=MACCS_keys.Parameters())[source]

Bases: RdkitDescriptor

MACCS

Molecular Access System (MACCS) fingerprint.

MACCS fingerprints (often referred to as MDL keys after the developing company ) are calculated using keysets originally constructed and optimized for substructure searching (see Durant et al. Reoptimization of MDL keys for use in drug discovery) are 166-bit 2D structure fingerprints.

Essentially, they are a binary fingerprint (zeros and ones) that answer 166 fragment related questions. If the explicitly defined fragment exists in the structure, the bit in that position is set to 1, and if not, it is set to 0. In that sense, the position of the bit matters because it is addressed to a specific question or a fragment. An atom can belong to multiple MACCS keys, and since each bit is binary, MACCS 166 keys can represent more than 9.3×1049 distinct fingerprint vectors.

class Parameters[source]

Bases: object

name
parameters = MACCS_keys.Parameters()
calculate_from_mol(mol)[source]

Returns a descriptor (fingerprint) for a given RDKit Mol as a 1-d Numpy array.

class optunaz.descriptors.UnfittedSklearnScaler(mol_data: MolData = UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name: Literal['UnfittedSklearnScaler'] = 'UnfittedSklearnScaler')[source]

Bases: object

class MolData(file_path: pathlib.Path = None, smiles_column: str = None)[source]

Bases: object

file_path = None
smiles_column = None
mol_data = UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None)
name = 'UnfittedSklearnScaler'
get_fitted_scaler_for_fp(fp, cache=None)[source]
class optunaz.descriptors.FittedSklearnScaler(saved_params: str, name: Literal['FittedSklearnScaler'] = 'FittedSklearnScaler')[source]

Bases: object

saved_params
name = 'FittedSklearnScaler'
get_fitted_scaler()[source]
class optunaz.descriptors.UnscaledMAPC(name, parameters)[source]

Bases: RdkitDescriptor

Unscaled MAPC descriptors

These MAPC descriptors are unscaled and should be used with caution. MinHashed Atom-Pair Fingerprint Chiral (see Orsi et al. One chiral fingerprint to find them all) is the original version of the MinHashed Atom-Pair fingerprint of radius 2 (MAP4) which combined circular substructure fingerprints and atom-pair fingerprints into a unified framework. This combination allowed for improved substructure perception and performance in small molecule benchmarks while retaining information about bond distances for molecular size and shape perception.

These fingerprints expand the functionality of MAP4 to include encoding of stereochemistry into the fingerprint. CIP descriptors of chiral atoms are encoded into the fingerprint at the highest radius. This allows MAPC to modulate the impact of stereochemistry on fingerprints, making it scale with increasing molecular size without disproportionally affecting structural fingerprints/similarity.

class Parameters(maxRadius=2, nPermutations=2048)[source]

Bases: object

Parameters:
  • maxRadius (int) – Maximum radius of the fingerprint. - minimum: 1, title: maxRadius

  • nPermutations (int) – Number of permutations to perform. - minimum: 1, title: nPermutations

maxRadius = 2
nPermutations = 2048
name
parameters
calculate_from_mol(mol)[source]

Returns a descriptor (fingerprint) for a given RDKit Mol as a 1-d Numpy array.

class optunaz.descriptors.UnscaledPhyschemDescriptors(name='UnscaledPhyschemDescriptors', parameters=UnscaledPhyschemDescriptors.Parameters(rdkit_names=None))[source]

Bases: RdkitDescriptor

Base (unscaled) PhyschemDescriptors (RDKit) for PhyschemDescriptors

These physchem descriptors are unscaled and should be used with caution. They are a set of 208 physchem/molecular properties that are calculated in RDKit and used as descriptor vectors for input molecules. Features include ClogP, MW, # of atoms, rings, rotatable bonds, fraction sp3 C, graph invariants (Kier indices etc), TPSA, Slogp descriptors, counts of some functional groups, VSA MOE-type descriptors, estimates of atomic charges etc. (See https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors).

Vectors whose components are molecular descriptors have been used as high-level feature representations for molecular machine learning. One advantage of molecular descriptor vectors is their interpretability, since the meaning of a physicochemical descriptor can be intuitively understood

class Parameters(rdkit_names: Optional[List[str]] = None)[source]

Bases: object

rdkit_names = None
name = 'UnscaledPhyschemDescriptors'
parameters = UnscaledPhyschemDescriptors.Parameters(rdkit_names=None)
calculate_from_mol(mol)[source]

Returns a descriptor (fingerprint) for a given RDKit Mol as a 1-d Numpy array.

class optunaz.descriptors.UnscaledJazzyDescriptors(name='UnscaledJazzyDescriptors', parameters=UnscaledJazzyDescriptors.Parameters(jazzy_names=None, jazzy_filters=None))[source]

Bases: MolDescriptor

Base (unscaled) Jazzy descriptors

These Jazzy descriptors are unscaled and should be used with caution. They offer a molecular vector describing the hydration free energies and hydrogen-bond acceptor and donor strengths. A publication describing the implementation, fitting, and validation of Jazzy can be found at doi.org/10.1038/s41598-023-30089-x. These descriptors use the “MMFF94” minimisation method. NB: this descriptor employs a threshold of <50 Hydrogen acceptors/donors and a Mw of <1000Da for compound inputs.

class Parameters(jazzy_names: Optional[List[str]] = None, jazzy_filters: Optional[Dict[str, Any]] = None)[source]

Bases: object

jazzy_names = None
jazzy_filters = None
name = 'UnscaledJazzyDescriptors'
parameters = UnscaledJazzyDescriptors.Parameters(jazzy_names=None, jazzy_filters=None)
calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

class optunaz.descriptors.UnscaledZScalesDescriptors(name='UnscaledZScalesDescriptors', parameters=UnscaledZScalesDescriptors.Parameters())[source]

Bases: MolDescriptor

Unscaled Z-Scales.

Compute the Z-scales of a peptide SMILES. These Z-Scales descriptors are unscaled and should be used with caution.

class Parameters[source]

Bases: object

name = 'UnscaledZScalesDescriptors'
parameters = UnscaledZScalesDescriptors.Parameters()
calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

class optunaz.descriptors.PrecomputedDescriptorFromFile(name, parameters)[source]

Bases: MolDescriptor

Precomputed descriptors.

Users can supply a CSV file of feature vectors to use as descriptors, with headers on the first line. Each row corresponds to a compound in the training set, followed by a column that may have comma-separated vectors describing that molecule.

class Parameters(file=None, input_column=None, response_column=None)[source]

Bases: object

Parameters:
  • file (str) – Name of the CSV containing precomputed descriptors - title: file

  • input_column (str) – Name of input column with SMILES strings - title: Input Column

  • response_column (str) – Name of response column with the comma-separated vectors that the model will use as pre-computed descriptors - title: Response column

file = None
input_column = None
response_column = None
name
parameters
calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

inference_parameters(file, input_column, response_column)[source]

This function allows precomputed descriptors to be used for inference for a new file

class optunaz.descriptors.SmilesFromFile(name, parameters=SmilesFromFile.Parameters())[source]

Bases: MolDescriptor

Smiles as descriptors (for ChemProp).

ChemProp optimisation runs require either this or SmilesAndSideInfoFromFile descriptor to be selected. This setting allows the SMILES to pass through to the ChemProp package.

class Parameters[source]

Bases: object

name
parameters = SmilesFromFile.Parameters()
calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

class optunaz.descriptors.SmilesAndSideInfoFromFile(name, parameters)[source]

Bases: MolDescriptor

SMILES & side information descriptors (for ChemProp).

ChemProp optimisation requires either these or SmilesFromFile descriptors. This descriptor allows SMILES to pass through to ChemProp, _and_ for side information to be supplied as auxiliary tasks.

Side information can take the form of any vector (continuous or binary) which describe input compounds. All tasks are learnt in a multi-task manner to improve main-task (task of intent) predictions. Side information can boost performance since their contribution to network loss can lead to improved learnt molecular representations.

Optimal side information weighting (how much auxiliary tasks contribute to network loss) is also an (optional) learned parameter during optimisation.

Similar to PrecomputedDescriptorFromFile, CSV inputs for this descriptor should contain a SMILES column of input molecules. All vectors in the remaining columns are used as user-derived side-information (i.e: be cautious to only upload a CSV with side information tasks in columns since _all_ are used)

(see https://ruder.io/multi-task/index.html#auxiliarytasks for details).

class Parameters(file=None, input_column=None, aux_weight_pc=SmilesAndSideInfoFromFile.Parameters.Aux_Weight_Pc(low=100, high=100, q=20))[source]

Bases: object

Parameters:
  • file (str) – Name of the CSV containing precomputed side-info descriptors - title: file

  • input_column (str) – Name of input column with SMILES strings - title: Input Column

  • aux_weight_pc (Aux_Weight_Pc) – How much (%) auxiliary tasks (side information) contribute (%)to the loss function optimised during training. The larger the number, the larger the weight of side information. - title: Auxiliary weight percentage

class Aux_Weight_Pc(low: int = 100, high: int = 100, q: int = 20)[source]

Bases: object

low = 100
high = 100
q = 20
file = None
input_column = None
aux_weight_pc = SmilesAndSideInfoFromFile.Parameters.Aux_Weight_Pc(low=100, high=100, q=20)
name
parameters
calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

class optunaz.descriptors.ScaledDescriptor(parameters, name='ScaledDescriptor')[source]

Bases: MolDescriptor

Scaled Descriptor.

This descriptor is not a complete descriptor, but instead it wraps and scales another descriptor.

Some algorithms require input to be within certain range, e.g. [-1..1]. Some descriptors have different ranges for different columns/features. This descriptor wraps another descriptor and provides scaled values.

class ScaledDescriptorParameters(descriptor: Union[Avalon, ECFP, ECFP_counts, PathFP, AmorProtDescriptors, MACCS_keys, PrecomputedDescriptorFromFile, UnscaledMAPC, UnscaledPhyschemDescriptors, UnscaledJazzyDescriptors, UnscaledZScalesDescriptors], scaler: Union[FittedSklearnScaler, UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'))[source]

Bases: object

descriptor
scaler = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler')
parameters
name = 'ScaledDescriptor'
set_unfitted_scaler_data(file_path, smiles_column, cache=None)[source]
calculate_from_smi(smi, cache=None)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

class optunaz.descriptors.PhyschemDescriptors(parameters=PhyschemDescriptors.Parameters(rdkit_names=None, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledPhyschemDescriptors'>), name='PhyschemDescriptors')[source]

Bases: ScaledDescriptor

PhyschemDescriptors (scaled) calculated in RDKit

A set of 208 physchem/molecular properties that are calculated in RDKit and used as descriptor vectors for input molecules. Features include ClogP, MW, # of atoms, rings, rotatable bonds, fraction sp3 C, graph invariants (Kier indices etc), TPSA, Slogp descriptors, counts of some functional groups, VSA MOE-type descriptors, estimates of atomic charges etc. (See https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors).

Vectors whose components are molecular descriptors have been used as high-level feature representations for molecular machine learning. One advantage of molecular descriptor vectors is their interpretability, since the meaning of a physicochemical descriptor can be intuitively understood

class Parameters(rdkit_names: Optional[List[str]] = None, scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledPhyschemDescriptors'>)[source]

Bases: object

rdkit_names = None
scaler = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler')
descriptor

alias of UnscaledPhyschemDescriptors

parameters = PhyschemDescriptors.Parameters(rdkit_names=None, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledPhyschemDescriptors'>)
name = 'PhyschemDescriptors'
class optunaz.descriptors.JazzyDescriptors(parameters=JazzyDescriptors.Parameters(jazzy_names=None, jazzy_filters=None, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledJazzyDescriptors'>), name='JazzyDescriptors')[source]

Bases: ScaledDescriptor

Scaled Jazzy descriptors

Jazzy descriptors offer a molecular vector describing the hydration free energies and hydrogen-bond acceptor and donor strengths. A publication describing the implementation, fitting, and validation of Jazzy can be found at doi.org/10.1038/s41598-023-30089-x. These descriptors use the “MMFF94” minimisation method. NB: Jazzy employs a threshold of <50 Hydrogen acceptors/donors and Mw of <1000Da for input compounds.

class Parameters(jazzy_names: Optional[List[str]] = None, jazzy_filters: Optional[Dict[str, Any]] = None, scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledJazzyDescriptors'>)[source]

Bases: object

jazzy_names = None
jazzy_filters = None
scaler = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler')
descriptor

alias of UnscaledJazzyDescriptors

name = 'JazzyDescriptors'
parameters = JazzyDescriptors.Parameters(jazzy_names=None, jazzy_filters=None, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledJazzyDescriptors'>)
class optunaz.descriptors.MAPC(parameters=MAPC.Parameters(maxRadius=2, nPermutations=2048, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledMAPC'>), name='MAPC')[source]

Bases: ScaledDescriptor

Scaled MAPC descriptors

MAPC (MinHashed Atom-Pair Fingerprint Chiral) (see Orsi et al. One chiral fingerprint to find them all) is the original version of the MinHashed Atom-Pair fingerprint of radius 2 (MAP4) which combined circular substructure fingerprints and atom-pair fingerprints into a unified framework. This combination allowed for improved substructure perception and performance in small molecule benchmarks while retaining information about bond distances for molecular size and shape perception.

These fingerprints expand the functionality of MAP4 to include encoding of stereochemistry into the fingerprint. CIP descriptors of chiral atoms are encoded into the fingerprint at the highest radius. This allows MAPC to modulate the impact of stereochemistry on fingerprints, making it scale with increasing molecular size without disproportionally affecting structural fingerprints/similarity.

class Parameters(maxRadius: int = 2, nPermutations: int = 2048, scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledMAPC'>)[source]

Bases: object

maxRadius = 2
nPermutations = 2048
scaler = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler')
descriptor

alias of UnscaledMAPC

name = 'MAPC'
parameters = MAPC.Parameters(maxRadius=2, nPermutations=2048, scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledMAPC'>)
class optunaz.descriptors.ZScalesDescriptors(parameters=ZScalesDescriptors.Parameters(scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledZScalesDescriptors'>), name='ZScalesDescriptors')[source]

Bases: ScaledDescriptor

Scaled Z-Scales descriptors.

Z-scales were proposed in Sandberg et al (1998) based on physicochemical properties of proteogenic and non-proteogenic amino acids, including NMR data and thin-layer chromatography (TLC) data. Refer to doi:10.1021/jm9700575 for the original publication. These descriptors capture 1. lipophilicity, 2. steric properties (steric bulk and polarizability), 3. electronic properties (polarity and charge), 4. electronegativity (heat of formation, electrophilicity and hardness) and 5. another electronegativity. This fingerprint is the computed average of Z-scales of all the amino acids in the peptide.

class Parameters(scaler: Union[optunaz.descriptors.FittedSklearnScaler, optunaz.descriptors.UnfittedSklearnScaler] = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor: Union[optunaz.descriptors.Avalon, optunaz.descriptors.ECFP, optunaz.descriptors.ECFP_counts, optunaz.descriptors.PathFP, optunaz.descriptors.AmorProtDescriptors, optunaz.descriptors.MACCS_keys, optunaz.descriptors.PrecomputedDescriptorFromFile, optunaz.descriptors.UnscaledMAPC, optunaz.descriptors.UnscaledPhyschemDescriptors, optunaz.descriptors.UnscaledJazzyDescriptors, optunaz.descriptors.UnscaledZScalesDescriptors] = <class 'optunaz.descriptors.UnscaledZScalesDescriptors'>)[source]

Bases: object

scaler = UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler')
descriptor

alias of UnscaledZScalesDescriptors

name = 'ZScalesDescriptors'
parameters = ZScalesDescriptors.Parameters(scaler=UnfittedSklearnScaler(mol_data=UnfittedSklearnScaler.MolData(file_path=None, smiles_column=None), name='UnfittedSklearnScaler'), descriptor=<class 'optunaz.descriptors.UnscaledZScalesDescriptors'>)
class optunaz.descriptors.CompositeDescriptor(parameters, name='CompositeDescriptor')[source]

Bases: MolDescriptor

Composite descriptor

Concatenates multiple descriptors into one. Select multiple algorithms from the button below. Please note the ChemProp SMILES descriptors are not compatible with this function.

class Parameters(descriptors: List[Union[Avalon, ECFP, ECFP_counts, PathFP, AmorProtDescriptors, MACCS_keys, PrecomputedDescriptorFromFile, UnscaledMAPC, UnscaledPhyschemDescriptors, UnscaledJazzyDescriptors, UnscaledZScalesDescriptors, ScaledDescriptor, MAPC, PhyschemDescriptors, JazzyDescriptors, ZScalesDescriptors]])[source]

Bases: object

descriptors
parameters
name = 'CompositeDescriptor'
calculate_from_smi(smi, cache=None)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

fp_info()[source]
class optunaz.descriptors.CanonicalSmiles(name, parameters=CanonicalSmiles.Parameters())[source]

Bases: MolDescriptor

Canonical Smiles for use in utility functions (not for user selection).

class Parameters[source]

Bases: object

name
parameters = CanonicalSmiles.Parameters()
calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

class optunaz.descriptors.Scaffold(name, parameters=Scaffold.Parameters())[source]

Bases: MolDescriptor

Scaffold Smiles for use in utility functions (not for user selection).

class Parameters[source]

Bases: object

name
parameters = Scaffold.Parameters()
calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

class optunaz.descriptors.GenericScaffold(name, parameters=GenericScaffold.Parameters())[source]

Bases: MolDescriptor

Generic Scaffold Smiles for use in utility functions (not for user selection).

class Parameters[source]

Bases: object

name
parameters = GenericScaffold.Parameters()
calculate_from_smi(smi)[source]

Returns a descriptor (e.g. a fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

class optunaz.descriptors.ValidDescriptor(name, parameters=ValidDescriptor.Parameters())[source]

Bases: MolDescriptor

Validates Smiles for use in utility functions (not for user selection).

class Parameters[source]

Bases: object

name
parameters = ValidDescriptor.Parameters()
calculate_from_smi(smi)[source]

Returns a descriptor (fingerprint) for a given SMILES string.

The descriptor is returned as a 1-d Numpy ndarray.

Returns None if input SMILES string is not valid according to RDKit.

optunaz.descriptors.descriptor_from_config(smiles, descriptor, cache=None, return_failed_idx=True)[source]

Returns molecular descriptors (fingerprints) for a given set of SMILES and configuration.

When return_failed_idx is True, this returns a 2d numpy array and valid indices for that descriptor When return_failed_idx is False, this returns the raw descriptor output (e.g. for canonical smiles etc)

optunaz.evaluate module

optunaz.evaluate.score_all(scores, estimator, X, y)[source]
optunaz.evaluate.get_scores(mode)[source]
optunaz.evaluate.score_all_smiles(scores, estimator, smiles, descriptor, aux, y, cache=None)[source]
optunaz.evaluate.get_train_test_scores(estimator, buildconfig, cache=None)[source]
optunaz.evaluate.get_merged_train_score(estimator, buildconfig, cache=None)[source]

optunaz.explainability module

optunaz.explainability.get_ecfp_fpinfo(m, descriptor)[source]

Return the ecfp info for a compound mol

optunaz.explainability.get_ecfpcount_fpinfo(m, descriptor)[source]

Return the ecfp_count info for a compound mol

optunaz.explainability.explain_ECFP(len_feats, estimator, descriptor)[source]

Explain ECFPs using train atom environments

optunaz.explainability.get_fp_info(exp_df, estimator, descript, fp_idx, strt_idx=None)[source]

Get ECFP SMILES environments or Physchem names when available

optunaz.explainability.runShap(estimator, X_pred, mode)[source]

Explain model prediction using auto explainer or SHAP KernelExplainer

optunaz.explainability.ShapExplainer(estimator, X_pred, mode, descriptor)[source]

Run SHAP and populate the explainability dataframe

optunaz.explainability.ExplainPreds(estimator, X_pred, mode, descriptor)[source]

Explain predictions using either SHAP (shallow models) or ChemProp interpret

optunaz.metircs module

optunaz.metircs.validate_cls_input(y_true, y_pred)[source]

Validate true and predicted arrays for metrics.

optunaz.metircs.auc_pr_cal(y_true, y_pred, pi_zero=0.1)[source]

Compute calibrated AUC PR metric.

Implemented according to MELLODDY SparseChem https://github.com/melloddy/SparseChem. Calibration modifies the AUC PR to account for class imbalance.

optunaz.metircs.bedroc_score(y_true, y_pred, alpha=20.0)[source]

Compute BEDROC metric.

Implemented according to Truchon, J. & Bayly, C.I. Evaluating Virtual Screening Methods: Good and Bad Metric for the “Early Recognition” Problem. J. Chem. Inf. Model. 47, 488-508 (2007).

optunaz.metircs.concordance_index(y_true, y_pred)[source]

Compute Concordance index.

Statistical metric to indicate the quality of a predicted ranking based on Harald, et al. “On ranking in survival analysis: Bounds on the concordance index.” Advances in neural information processing systems (2008): 1209-1216.

optunaz.model_writer module

class optunaz.model_writer.Predictor[source]

Bases: ABC

Interface definition for scikit-learn/chemprop Predictor.

Scikit-learn does not define a class that describes the Predictor interface. Instead, scikit-learn describes in text that Predictor should have method ‘predict’, and optionally ‘predict_proba’: https://scikit-learn.org/stable/developers/develop.html#apis-of-scikit-learn-objects

This class describes this interface as an abstract Python class, for convenience and better type checking.

abstract predict(data)[source]

Returns predicted values.

predict_proba(data)[source]

For Classification algorithms, returns algorithmic posterior of a prediction.

This method is optional, and is not marked with @abstractmethod.

predict_uncert(data)[source]

For supported algorithms, quantifies uncertainty of a prediction.

This method is optional, and is not marked with @abstractmethod.

explain(data)[source]

Explains a prediction.

This method is optional, and is not marked with @abstractmethod.

class optunaz.model_writer.QSARtunaModel(predictor: Predictor, descriptor: Union[Avalon, ECFP, ECFP_counts, PathFP, AmorProtDescriptors, MACCS_keys, PrecomputedDescriptorFromFile, UnscaledMAPC, UnscaledPhyschemDescriptors, UnscaledJazzyDescriptors, UnscaledZScalesDescriptors, ScaledDescriptor, MAPC, PhyschemDescriptors, JazzyDescriptors, ZScalesDescriptors, CompositeDescriptor, SmilesFromFile, SmilesAndSideInfoFromFile], mode: ModelMode, transform: Optional[ModelDataTransform] = None, aux_transform: Union[VectorFromColumn, ZScales, AmorProt, NoneType] = None, metadata: Optional[Dict] = None)[source]

Bases: ABC

predictor
descriptor
mode
transform = None
aux_transform = None
metadata = None
predict_from_smiles(smiles, aux=None, uncert=False, explain=False, transform='default', aux_transform=None)[source]

Returns model predictions for the input SMILES strings.

If some input smiles are invalid for the descriptor, in which case the descriptor returns None, those None values are not sent to the model; instead, NaN is used as predicted values for those invalid SMILES.

optunaz.model_writer.get_metadata(buildconfig, train_scores, test_scores)[source]

Metadata for a predictive model.

optunaz.model_writer.get_transform(data)[source]
optunaz.model_writer.perform_ptr(metadata, transform, predictions)[source]
optunaz.model_writer.wrap_model(model, descriptor, mode, transform=None, aux_transform=None, metadata=None)[source]
optunaz.model_writer.save_model(model, buildconfig, filename, train_scores, test_scores)[source]

optunaz.objective module

exception optunaz.objective.NoValidDescriptors[source]

Bases: Exception

Raised when none of the supplied descriptors are compatible with any of the supplied algorithms

optunaz.objective.null_scores(scoring)[source]
class optunaz.objective.Objective(optconfig: OptimizationConfig, train_smiles: List[str], train_y: numpy.ndarray, train_aux: numpy.ndarray = None, cache: Optional[joblib.memory.Memory] = None)[source]

Bases: object

optconfig
train_smiles
train_y
train_aux = None
cache = None

optunaz.optbuild module

optunaz.optbuild.main()[source]

optunaz.predict module

exception optunaz.predict.ArgsError[source]

Bases: Exception

Thrown when there is an issue with basic args at inference time

exception optunaz.predict.UncertaintyError[source]

Bases: Exception

Thrown when uncertainty parameters are not set correctly at inference

exception optunaz.predict.AuxCovariateMissing[source]

Bases: Exception

Thrown when a model is trained using Auxiliary (covariate) data which is not supplied at inference

exception optunaz.predict.PrecomputedError[source]

Bases: Exception

Raised when a model is trained with precomputed descriptor not supplied at runtime or due to a missing argument

optunaz.predict.validate_args(args)[source]
optunaz.predict.validate_uncertainty(args, model)[source]
optunaz.predict.check_precomp_args(args)[source]
optunaz.predict.set_inference_params(args, desc)[source]
optunaz.predict.validate_set_precomputed(args, model)[source]
optunaz.predict.validate_aux(args, model)[source]
optunaz.predict.main()[source]

optunaz.schemagen module

optunaz.schemagen.doctitle(doc)[source]

Returns the first line of the docstring as the title.

optunaz.schemagen.type_base_schema(tp)[source]

Adds title and description from docstrings.

See https://wyfo.github.io/apischema/0.16/json_schema/#base-schema

optunaz.schemagen.patch_schema_generic(schema)[source]
optunaz.schemagen.patch_schema_optunaz(schema)[source]
optunaz.schemagen.main()[source]

optunaz.three_step_opt_build_merge module

optunaz.three_step_opt_build_merge.split_optimize(optconfig)[source]

Split Hyperparameter runs into non-chemprop and chemprop runs for Optuna.

optunaz.three_step_opt_build_merge.base_chemprop_params(alg)[source]

Used to enqueue an initial ChemProp run that captures sensible defaults as defined by original authors. A Check is performed to ensure any parameters outside valid Optuna subspace are popped from fixed parameters.

optunaz.three_step_opt_build_merge.run_study(optconfig, study_name, objective, n_startup_trials, n_trials, seed, storage=True, trial_number_offset=0)[source]

Run an Optuna study

optunaz.three_step_opt_build_merge.optimize(optconfig, study_name=None)[source]

Step 1. Hyperparameter optimization using Optuna.

optunaz.three_step_opt_build_merge.buildconfig_best(study)[source]
optunaz.three_step_opt_build_merge.log_scores(scores, main_score, label)[source]
optunaz.three_step_opt_build_merge.build_best(buildconfig, outfname, cache=None)[source]

Step 2. Build. Train a model with the best hyperparameters.

optunaz.three_step_opt_build_merge.build_merged(buildconfig, outfname, cache=None)[source]

Step 3. Merge datasets and re-train the model.

optunaz.visualizer module

class optunaz.visualizer.Visualizer[source]

Bases: object

Class to visualize various aspects of the optimization / building process.

plot_by_configuration(conf, study)[source]
plot_slice(folder_path, study, file_format='png')[source]
plot_parallel_coordinate(folder_path, study, file_format='png')[source]
plot_contour(folder_path, study, file_format='png')[source]
static plot_history(file_path, study)[source]

Module contents