optunaz.utils.preprocessing package

Submodules

optunaz.utils.preprocessing.deduplicator module

class optunaz.utils.preprocessing.deduplicator.Deduplicator[source]

Bases: object

Base class for deduplicators.

Each deduplicator should provide method dedup, which takes dataframe and name of SMILES column, and returns dataframe with duplicates removed.

abstract dedup(df, smiles_col)[source]
class optunaz.utils.preprocessing.deduplicator.KeepFirst(name='KeepFirst')[source]

Bases: Deduplicator

Keep first.

name = 'KeepFirst'
dedup(df, smiles_col)[source]
class optunaz.utils.preprocessing.deduplicator.KeepLast(name='KeepLast')[source]

Bases: Deduplicator

Keep last.

name = 'KeepLast'
dedup(df, smiles_col)[source]
class optunaz.utils.preprocessing.deduplicator.KeepRandom(name='KeepRandom', seed=None)[source]

Bases: Deduplicator

Keep random.

name = 'KeepRandom'
seed = None
dedup(df, smiles_col)[source]
class optunaz.utils.preprocessing.deduplicator.KeepMin(name='KeepMin')[source]

Bases: Deduplicator

Keep min.

name = 'KeepMin'
dedup(df, smiles_col)[source]
class optunaz.utils.preprocessing.deduplicator.KeepMax(name='KeepMax')[source]

Bases: Deduplicator

Keep max.

name = 'KeepMax'
dedup(df, smiles_col)[source]
class optunaz.utils.preprocessing.deduplicator.KeepAvg(name='KeepAvg')[source]

Bases: Deduplicator

Keep average. Classification will threshold at 0.5.

This deduplicator converts input SMILES to canonical SMILES.

name = 'KeepAvg'
dedup(df, smiles_col)[source]

For regression, keep mean value.

class optunaz.utils.preprocessing.deduplicator.KeepMedian(name='KeepMedian')[source]

Bases: Deduplicator

Keep median. Classification will threshold at 0.5.

This deduplicator converts input SMILES to canonical SMILES.

name = 'KeepMedian'
dedup(df, smiles_col)[source]

For regression, keep median value.

class optunaz.utils.preprocessing.deduplicator.KeepAllNoDeduplication(name='KeepAllNoDeduplication')[source]

Bases: Deduplicator

Keep all.

Do not perform any deduplication.

name = 'KeepAllNoDeduplication'
dedup(df, smiles_col)[source]

optunaz.utils.preprocessing.splitter module

class optunaz.utils.preprocessing.splitter.SklearnSplitter[source]

Bases: ABC

Interface definition for scikit-learn cross-validation splitter.

Scikit-learn does not define a class that describes the splitter interface. Instead, scikit-learn describes in text that splitter should have two methods: ‘get_n_splits’ and ‘split’.

This class describes this splitter interface as an abstract Python class, for convenience and better type checking.

abstract get_n_splits(X, y, groups)[source]
abstract split(X, y, groups)[source]
class optunaz.utils.preprocessing.splitter.Splitter[source]

Bases: object

Splitter for input data.

This is the base class for classes that split input data into train and test.

See also CvSplitter for making multiple cross-validation splits.

Splitter and CvSplitter are used to define valid input choices for splitting data into train-test sets, and for splitting train data into cross-validation splits in scikit-learn cross_validate function. These two sets of options might be different (although underlying implementations might be merged).

split(X, y=None, groups=None)[source]

Splits input and returns indices for train and test sets.

Returns two numpy arrays: one with indices of train set, and one with indices of test set.

Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.

abstract get_sklearn_splitter(n_splits)[source]
class optunaz.utils.preprocessing.splitter.Random(name='Random', fraction=0.2, seed=1, leave_out=0.0)[source]

Bases: Splitter

Random split.

Parameters:
  • name (Literal) –

  • fraction (float) – Fraction of samples to use for test set. - minimum: 0.0, maximum: 0.999, title: Fraction samples

  • seed (Optional) – Seed for random number generator, for repeatable splits. - title: Seed for random number generator

  • leave_out (Optional) – Fraction of samples that will not be used in train or test set, to reduce compute time. - minimum: 0.0, maximum: 0.999, title: Leave out fraction

name = 'Random'
fraction = 0.2
seed = 1
leave_out = 0.0
get_sklearn_splitter(n_splits)[source]
class optunaz.utils.preprocessing.splitter.Temporal(name='Temporal', fraction=0.2)[source]

Bases: Splitter

Temporal split.

Assumes that the data is sorted, with the oldest entries in the beginning of the file, and the newest entries added at the end.

name = 'Temporal'
fraction = 0.2
split(X, y=None, groups=None)[source]

Splits input and returns indices for train and test sets.

Returns two numpy arrays: one with indices of train set, and one with indices of test set.

Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.

get_sklearn_splitter(n_splits)[source]
class optunaz.utils.preprocessing.splitter.Stratified(name='Stratified', fraction=0.2, seed=1, leave_out=0.0, bins='fd_merge')[source]

Bases: Splitter

Real-valued Stratified Shuffle Split.

Parameters:
  • name (Literal) –

  • fraction (float) – Fraction of samples to use for test set. - minimum: 0.0, maximum: 0.999, title: Fraction samples

  • seed (Optional) – Seed for random number generator, for repeatable splits. - title: Seed for random number generator

  • leave_out (Optional) – Fraction of samples that will not be used in train or test set, to reduce compute time. - minimum: 0.0, maximum: 0.999, title: Leave out fraction

  • bins (str) – Algorithm to use for determining histogram bin edges, see numpy.histogram for possible options, or use default ‘fd’ - title: Binning algorithm

This is similar to scikit-learn StratifiedShuffleSplit, but uses histogram binning for real-valued inputs.

If inputs are integers (or strings), this splitter reverts to StratifiedShuffleSplit.

name = 'Stratified'
fraction = 0.2
seed = 1
leave_out = 0.0
bins = 'fd_merge'
get_sklearn_splitter(n_splits)[source]
class optunaz.utils.preprocessing.splitter.NoSplitting(name='NoSplitting')[source]

Bases: Splitter

No splitting.

Do not perform any splitting. Returns all input data as training set, and returns an empty test set.

name = 'NoSplitting'
split(X, y=None, groups=None)[source]

Splits input and returns indices for train and test sets.

Returns two numpy arrays: one with indices of train set, and one with indices of test set.

Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.

get_sklearn_splitter(n_splits)[source]
class optunaz.utils.preprocessing.splitter.KFold(name='KFold', shuffle=True, random_state=None)[source]

Bases: Splitter

KFold.

Parameters:
  • name (Literal) –

  • shuffle (bool) – Whether to shuffle the data before splitting into batches. Note that the samples within each split will not be shuffled. - title: Shuffle

  • random_state (Optional) – When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold. Otherwise, this parameter has no effect. Pass an int for reproducible output across multiple function calls. - title: Random state

Split dataset into k consecutive folds (without shuffling by default).

Each fold is then used once as a validation, while the k - 1 remaining folds form the training set.

name = 'KFold'
shuffle = True
random_state = None
split(X, y=None, groups=None)[source]

Splits input and returns indices for train and test sets.

Returns two numpy arrays: one with indices of train set, and one with indices of test set.

Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.

get_sklearn_splitter(n_splits)[source]
optunaz.utils.preprocessing.splitter.fd_bin(y)[source]

Empty bin merging histogram based on: https://github.com/numpy/numpy/issues/11879 and https://github.com/numpy/numpy/issues/10297

The modification avoids this via merging adjacent empty bins

optunaz.utils.preprocessing.splitter.stratify(y, bins='fd')[source]

Stratifies (splits into groups) the values in ‘y’.

If input ‘y’ is real-valued (numpy.dtype.kind == ‘f’), this function bins the values based on computed histogram edges.

For all other types of inputs, this function returns the original array, since downstream algorithms can natively deal with integer and categorical data.

class optunaz.utils.preprocessing.splitter.HistogramStratifiedShuffleSplit(test_fraction=0.1, n_splits=10, bins='fd_merge', random_state=42, train_size=0.0)[source]

Bases: SklearnSplitter

StratifiedShuffleSplit for real-valued inputs.

test_fraction = 0.1
n_splits = 10
bins = 'fd_merge'
random_state = 42
train_size = 0.0
get_n_splits(X=None, y=None, groups=None)[source]
split(X, y, groups=None)[source]
class optunaz.utils.preprocessing.splitter.GroupingSplitter[source]

Bases: Splitter, ABC

Splitter for methods using the group method

This is the base class for the Predefined and ScaffoldSplit classes.

abstract groups(df, smiles_col)[source]
class optunaz.utils.preprocessing.splitter.Predefined(column_name=None, name='Predefined')[source]

Bases: GroupingSplitter

Predefined split.

Parameters:
  • column_name (str) – Name of the column with labels for splits. Use -1 to denote datapoints for the train set - title: Column Name

  • name (Literal) –

Splits data based predefined labels in a column. Integers can be used, and -1 flags datapoints for use only in the training set. Data points with missing (NaN) values will be removed from train or test

column_name = None
name = 'Predefined'
get_sklearn_splitter(n_splits)[source]
split(X, y=None, groups=None)[source]

Splits input and returns indices for train and test sets.

Returns two numpy arrays: one with indices of train set, and one with indices of test set.

Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.

groups(df, smiles_col)[source]
optunaz.utils.preprocessing.splitter.butina_cluster(groups, cutoff=0.4)[source]

Clusters the scaffolds based on Butina and returns the scaffold grouping labels

class optunaz.utils.preprocessing.splitter.ScaffoldSplit(bins='fd_merge', random_state=42, make_scaffold_generic=True, butina_cluster=0.4, name='ScaffoldSplit')[source]

Bases: GroupingSplitter

Stratified Group K Fold based on chemical scaffold.

Parameters:
  • bins (str) – Algorithm to use for determining histogram bin edges, see numpy.histogram for possible options, or use default ‘fd’ - title: Binning algorithm

  • random_state (Optional) –

  • make_scaffold_generic (bool) – Makes Murcko scaffolds generic by removing hetero-atoms - title: Make scaffold generic

  • butina_cluster (float) – Butina clustering to aggregate scaffolds into shared folds. Elements within this cluster range are considered neighbors, increasing test difficulty. 0.0 turns Butina clustering off - minimum: 0.0, maximum: 1.0, title: Cluster threshold

  • name (Literal) –

Splits data based chemical (Murcko) scaffolds for the compounds in the user input data. This emulates the real-world scenario when models are applied to novel chemical space

bins = 'fd_merge'
random_state = 42
make_scaffold_generic = True
butina_cluster = 0.4
name = 'ScaffoldSplit'
get_sklearn_splitter(n_splits)[source]
get_n_splits(X=None, y=None, groups=None)[source]
split(X, y=None, groups=None)[source]

Splits input and returns indices for train and test sets.

Returns two numpy arrays: one with indices of train set, and one with indices of test set.

Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.

groups(df, smiles_col)[source]

Calculate scaffold smiles from a smiles column

optunaz.utils.preprocessing.transform module

exception optunaz.utils.preprocessing.transform.DataTransformError[source]

Bases: Exception

Raised when insufficient molecules for UnfittedSklearnSclaer to fit

class optunaz.utils.preprocessing.transform.DataTransform[source]

Bases: NameParameterDataclass, ABC

Base class for auxiliary transformers.

Each data transformer should provide method transform, which takes raw input data, and returns numpy arrays with transformed output data.

abstract transform(y_)[source]
class optunaz.utils.preprocessing.transform.PTRTransform(name='PTRTransform', parameters=PTRTransform.Parameters(threshold=None, std=None))[source]

Bases: DataTransform

Transform model input/output with PTR

class Parameters(threshold=None, std=None)[source]

Bases: object

Parameters:
  • threshold (float) – The decision boundary for discretising active or inactive classes used by PTR. - title: PTR Threshold

  • std (float) – The standard deviation used by PTR, e.g. experimental reproducibility/uncertainty - title: PTR standard deviation

threshold = None
std = None
name = 'PTRTransform'
parameters = PTRTransform.Parameters(threshold=None, std=None)
transform(y_)[source]
reverse_transform(y_)[source]
class optunaz.utils.preprocessing.transform.LogBase(value)[source]

Bases: str, Enum

Base for Numpy transform in ModelDataTransform

LOG2 = 'log2'
LOG10 = 'log10'
LOG = 'log'
class optunaz.utils.preprocessing.transform.LogNegative(value)[source]

Bases: str, Enum

Base for Numpy negated

TRUE = 'True'
FALSE = 'False'
class optunaz.utils.preprocessing.transform.ModelDataTransform(name='ModelDataTransform', parameters=ModelDataTransform.Parameters(base=None, negation=None, conversion=None))[source]

Bases: DataTransform

Data transformer that applies and reverses logarithmic functions to user data

class Parameters(base=None, negation=None, conversion=None)[source]

Bases: object

Parameters:
  • base (LogBase) – The log, log2 or log10 base to use in log transformation - title: Base

  • negation (LogNegative) – Whether or not to make the log transform performed negated (-) - title: Negation

  • conversion (Optional) – The conversion power applied in the log transformation - title: Conversion power

base = None
negation = None
conversion = None
name = 'ModelDataTransform'
parameters = ModelDataTransform.Parameters(base=None, negation=None, conversion=None)
base_dict = {LogBase.LOG: <ufunc 'log'>, LogBase.LOG10: <ufunc 'log10'>, LogBase.LOG2: <ufunc 'log2'>}
base_negation = {LogNegative.FALSE: False, LogNegative.TRUE: True}
reverse_dict = {LogBase.LOG: <ufunc 'exp'>, LogBase.LOG10: <function ModelDataTransform.<lambda>>, LogBase.LOG2: <function ModelDataTransform.<lambda>>}
transform_df(df)[source]
transform_one(value)[source]
reverse_transform_df(df)[source]
reverse_transform_one(value)[source]
transform(y_)[source]
reverse_transform(y_)[source]
class optunaz.utils.preprocessing.transform.AuxTransformer[source]

Bases: DataTransform

Base class for Auxiliary transformation classes

Each auxiliary data transforation provides the method transform, which takes raw auxiliary data, and returns numpy arrays with transformed auxiliary data.

abstract transform(auxiliary_data)[source]
class optunaz.utils.preprocessing.transform.VectorFromColumn(name='VectorFromColumn', parameters=VectorFromColumn.Parameters(delimiter=','))[source]

Bases: AuxTransformer

Vector from column

Splits delimited values from in inputs into usable vectors

class Parameters(delimiter=',')[source]

Bases: object

Parameters:

delimiter (str) – String used to split the auxiliary column into a vector - title: Delimiter

delimiter = ','
name = 'VectorFromColumn'
parameters = VectorFromColumn.Parameters(delimiter=',')
transform(auxiliary_data)[source]
class optunaz.utils.preprocessing.transform.ZScales(name='ZScales', parameters=ZScales.Parameters())[source]

Bases: AuxTransformer

Z-scales from column

Calculates Z-scores for sequences or a predefined list of peptide/protein targets

class Parameters[source]

Bases: object

name = 'ZScales'
parameters = ZScales.Parameters()
transform(auxiliary_data)[source]
class optunaz.utils.preprocessing.transform.AmorProt(name='AmorProt', parameters=AmorProt.Parameters())[source]

Bases: AuxTransformer

AmorProt from column

Calculates AmorProt for sequences or a predefined list of peptide/protein targets

class Parameters[source]

Bases: object

name = 'AmorProt'
parameters = AmorProt.Parameters()
transform(auxiliary_data)[source]

Module contents