optunaz.utils.preprocessing package
Submodules
optunaz.utils.preprocessing.deduplicator module
- class optunaz.utils.preprocessing.deduplicator.Deduplicator[source]
Bases:
object
Base class for deduplicators.
Each deduplicator should provide method dedup, which takes dataframe and name of SMILES column, and returns dataframe with duplicates removed.
- class optunaz.utils.preprocessing.deduplicator.KeepFirst(name='KeepFirst')[source]
Bases:
Deduplicator
Keep first.
- name = 'KeepFirst'
- class optunaz.utils.preprocessing.deduplicator.KeepLast(name='KeepLast')[source]
Bases:
Deduplicator
Keep last.
- name = 'KeepLast'
- class optunaz.utils.preprocessing.deduplicator.KeepRandom(name='KeepRandom', seed=None)[source]
Bases:
Deduplicator
Keep random.
- name = 'KeepRandom'
- seed = None
- class optunaz.utils.preprocessing.deduplicator.KeepMin(name='KeepMin')[source]
Bases:
Deduplicator
Keep min.
- name = 'KeepMin'
- class optunaz.utils.preprocessing.deduplicator.KeepMax(name='KeepMax')[source]
Bases:
Deduplicator
Keep max.
- name = 'KeepMax'
- class optunaz.utils.preprocessing.deduplicator.KeepAvg(name='KeepAvg')[source]
Bases:
Deduplicator
Keep average. Classification will threshold at 0.5.
This deduplicator converts input SMILES to canonical SMILES.
- name = 'KeepAvg'
- class optunaz.utils.preprocessing.deduplicator.KeepMedian(name='KeepMedian')[source]
Bases:
Deduplicator
Keep median. Classification will threshold at 0.5.
This deduplicator converts input SMILES to canonical SMILES.
- name = 'KeepMedian'
optunaz.utils.preprocessing.splitter module
- class optunaz.utils.preprocessing.splitter.SklearnSplitter[source]
Bases:
ABC
Interface definition for scikit-learn cross-validation splitter.
Scikit-learn does not define a class that describes the splitter interface. Instead, scikit-learn describes in text that splitter should have two methods: ‘get_n_splits’ and ‘split’.
This class describes this splitter interface as an abstract Python class, for convenience and better type checking.
- class optunaz.utils.preprocessing.splitter.Splitter[source]
Bases:
object
Splitter for input data.
This is the base class for classes that split input data into train and test.
See also CvSplitter for making multiple cross-validation splits.
Splitter and CvSplitter are used to define valid input choices for splitting data into train-test sets, and for splitting train data into cross-validation splits in scikit-learn cross_validate function. These two sets of options might be different (although underlying implementations might be merged).
- split(X, y=None, groups=None)[source]
Splits input and returns indices for train and test sets.
Returns two numpy arrays: one with indices of train set, and one with indices of test set.
Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.
- class optunaz.utils.preprocessing.splitter.Random(name='Random', fraction=0.2, seed=1, leave_out=0.0)[source]
Bases:
Splitter
Random split.
- Parameters:
name (Literal) –
fraction (float) – Fraction of samples to use for test set. - minimum: 0.0, maximum: 0.999, title: Fraction samples
seed (Optional) – Seed for random number generator, for repeatable splits. - title: Seed for random number generator
leave_out (Optional) – Fraction of samples that will not be used in train or test set, to reduce compute time. - minimum: 0.0, maximum: 0.999, title: Leave out fraction
- name = 'Random'
- fraction = 0.2
- seed = 1
- leave_out = 0.0
- class optunaz.utils.preprocessing.splitter.Temporal(name='Temporal', fraction=0.2)[source]
Bases:
Splitter
Temporal split.
Assumes that the data is sorted, with the oldest entries in the beginning of the file, and the newest entries added at the end.
- name = 'Temporal'
- fraction = 0.2
- split(X, y=None, groups=None)[source]
Splits input and returns indices for train and test sets.
Returns two numpy arrays: one with indices of train set, and one with indices of test set.
Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.
- class optunaz.utils.preprocessing.splitter.Stratified(name='Stratified', fraction=0.2, seed=1, leave_out=0.0, bins='fd_merge')[source]
Bases:
Splitter
Real-valued Stratified Shuffle Split.
- Parameters:
name (Literal) –
fraction (float) – Fraction of samples to use for test set. - minimum: 0.0, maximum: 0.999, title: Fraction samples
seed (Optional) – Seed for random number generator, for repeatable splits. - title: Seed for random number generator
leave_out (Optional) – Fraction of samples that will not be used in train or test set, to reduce compute time. - minimum: 0.0, maximum: 0.999, title: Leave out fraction
bins (str) – Algorithm to use for determining histogram bin edges, see numpy.histogram for possible options, or use default ‘fd’ - title: Binning algorithm
This is similar to scikit-learn StratifiedShuffleSplit, but uses histogram binning for real-valued inputs.
If inputs are integers (or strings), this splitter reverts to StratifiedShuffleSplit.
- name = 'Stratified'
- fraction = 0.2
- seed = 1
- leave_out = 0.0
- bins = 'fd_merge'
- class optunaz.utils.preprocessing.splitter.NoSplitting(name='NoSplitting')[source]
Bases:
Splitter
No splitting.
Do not perform any splitting. Returns all input data as training set, and returns an empty test set.
- name = 'NoSplitting'
- split(X, y=None, groups=None)[source]
Splits input and returns indices for train and test sets.
Returns two numpy arrays: one with indices of train set, and one with indices of test set.
Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.
- class optunaz.utils.preprocessing.splitter.KFold(name='KFold', shuffle=True, random_state=None)[source]
Bases:
Splitter
KFold.
- Parameters:
name (Literal) –
shuffle (bool) – Whether to shuffle the data before splitting into batches. Note that the samples within each split will not be shuffled. - title: Shuffle
random_state (Optional) – When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold. Otherwise, this parameter has no effect. Pass an int for reproducible output across multiple function calls. - title: Random state
Split dataset into k consecutive folds (without shuffling by default).
Each fold is then used once as a validation, while the k - 1 remaining folds form the training set.
- name = 'KFold'
- shuffle = True
- random_state = None
- split(X, y=None, groups=None)[source]
Splits input and returns indices for train and test sets.
Returns two numpy arrays: one with indices of train set, and one with indices of test set.
Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.
- optunaz.utils.preprocessing.splitter.fd_bin(y)[source]
Empty bin merging histogram based on: https://github.com/numpy/numpy/issues/11879 and https://github.com/numpy/numpy/issues/10297
The modification avoids this via merging adjacent empty bins
- optunaz.utils.preprocessing.splitter.stratify(y, bins='fd')[source]
Stratifies (splits into groups) the values in ‘y’.
If input ‘y’ is real-valued (numpy.dtype.kind == ‘f’), this function bins the values based on computed histogram edges.
For all other types of inputs, this function returns the original array, since downstream algorithms can natively deal with integer and categorical data.
- class optunaz.utils.preprocessing.splitter.HistogramStratifiedShuffleSplit(test_fraction=0.1, n_splits=10, bins='fd_merge', random_state=42, train_size=0.0)[source]
Bases:
SklearnSplitter
StratifiedShuffleSplit for real-valued inputs.
- test_fraction = 0.1
- n_splits = 10
- bins = 'fd_merge'
- random_state = 42
- train_size = 0.0
- class optunaz.utils.preprocessing.splitter.GroupingSplitter[source]
Bases:
Splitter
,ABC
Splitter for methods using the group method
This is the base class for the Predefined and ScaffoldSplit classes.
- class optunaz.utils.preprocessing.splitter.Predefined(column_name=None, name='Predefined')[source]
Bases:
GroupingSplitter
Predefined split.
- Parameters:
column_name (str) – Name of the column with labels for splits. Use -1 to denote datapoints for the train set - title: Column Name
name (Literal) –
Splits data based predefined labels in a column. Integers can be used, and -1 flags datapoints for use only in the training set. Data points with missing (NaN) values will be removed from train or test
- column_name = None
- name = 'Predefined'
- split(X, y=None, groups=None)[source]
Splits input and returns indices for train and test sets.
Returns two numpy arrays: one with indices of train set, and one with indices of test set.
Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.
- optunaz.utils.preprocessing.splitter.butina_cluster(groups, cutoff=0.4)[source]
Clusters the scaffolds based on Butina and returns the scaffold grouping labels
- class optunaz.utils.preprocessing.splitter.ScaffoldSplit(bins='fd_merge', random_state=42, make_scaffold_generic=True, butina_cluster=0.4, name='ScaffoldSplit')[source]
Bases:
GroupingSplitter
Stratified Group K Fold based on chemical scaffold.
- Parameters:
bins (str) – Algorithm to use for determining histogram bin edges, see numpy.histogram for possible options, or use default ‘fd’ - title: Binning algorithm
random_state (Optional) –
make_scaffold_generic (bool) – Makes Murcko scaffolds generic by removing hetero-atoms - title: Make scaffold generic
butina_cluster (float) – Butina clustering to aggregate scaffolds into shared folds. Elements within this cluster range are considered neighbors, increasing test difficulty. 0.0 turns Butina clustering off - minimum: 0.0, maximum: 1.0, title: Cluster threshold
name (Literal) –
Splits data based chemical (Murcko) scaffolds for the compounds in the user input data. This emulates the real-world scenario when models are applied to novel chemical space
- bins = 'fd_merge'
- random_state = 42
- make_scaffold_generic = True
- butina_cluster = 0.4
- name = 'ScaffoldSplit'
- split(X, y=None, groups=None)[source]
Splits input and returns indices for train and test sets.
Returns two numpy arrays: one with indices of train set, and one with indices of test set.
Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.
optunaz.utils.preprocessing.transform module
- exception optunaz.utils.preprocessing.transform.DataTransformError[source]
Bases:
Exception
Raised when insufficient molecules for UnfittedSklearnSclaer to fit
- class optunaz.utils.preprocessing.transform.DataTransform[source]
Bases:
NameParameterDataclass
,ABC
Base class for auxiliary transformers.
Each data transformer should provide method transform, which takes raw input data, and returns numpy arrays with transformed output data.
- class optunaz.utils.preprocessing.transform.PTRTransform(name='PTRTransform', parameters=PTRTransform.Parameters(threshold=None, std=None))[source]
Bases:
DataTransform
Transform model input/output with PTR
- class Parameters(threshold=None, std=None)[source]
Bases:
object
- Parameters:
threshold (float) – The decision boundary for discretising active or inactive classes used by PTR. - title: PTR Threshold
std (float) – The standard deviation used by PTR, e.g. experimental reproducibility/uncertainty - title: PTR standard deviation
- threshold = None
- std = None
- name = 'PTRTransform'
- parameters = PTRTransform.Parameters(threshold=None, std=None)
- class optunaz.utils.preprocessing.transform.LogBase(value)[source]
Bases:
str
,Enum
Base for Numpy transform in ModelDataTransform
- LOG2 = 'log2'
- LOG10 = 'log10'
- LOG = 'log'
- class optunaz.utils.preprocessing.transform.LogNegative(value)[source]
Bases:
str
,Enum
Base for Numpy negated
- TRUE = 'True'
- FALSE = 'False'
- class optunaz.utils.preprocessing.transform.ModelDataTransform(name='ModelDataTransform', parameters=ModelDataTransform.Parameters(base=None, negation=None, conversion=None))[source]
Bases:
DataTransform
Data transformer that applies and reverses logarithmic functions to user data
- class Parameters(base=None, negation=None, conversion=None)[source]
Bases:
object
- Parameters:
base (LogBase) – The log, log2 or log10 base to use in log transformation - title: Base
negation (LogNegative) – Whether or not to make the log transform performed negated (-) - title: Negation
conversion (Optional) – The conversion power applied in the log transformation - title: Conversion power
- base = None
- negation = None
- conversion = None
- name = 'ModelDataTransform'
- parameters = ModelDataTransform.Parameters(base=None, negation=None, conversion=None)
- base_dict = {LogBase.LOG: <ufunc 'log'>, LogBase.LOG10: <ufunc 'log10'>, LogBase.LOG2: <ufunc 'log2'>}
- base_negation = {LogNegative.FALSE: False, LogNegative.TRUE: True}
- reverse_dict = {LogBase.LOG: <ufunc 'exp'>, LogBase.LOG10: <function ModelDataTransform.<lambda>>, LogBase.LOG2: <function ModelDataTransform.<lambda>>}
- class optunaz.utils.preprocessing.transform.AuxTransformer[source]
Bases:
DataTransform
Base class for Auxiliary transformation classes
Each auxiliary data transforation provides the method transform, which takes raw auxiliary data, and returns numpy arrays with transformed auxiliary data.
- class optunaz.utils.preprocessing.transform.VectorFromColumn(name='VectorFromColumn', parameters=VectorFromColumn.Parameters(delimiter=','))[source]
Bases:
AuxTransformer
Vector from column
Splits delimited values from in inputs into usable vectors
- class Parameters(delimiter=',')[source]
Bases:
object
- Parameters:
delimiter (str) – String used to split the auxiliary column into a vector - title: Delimiter
- delimiter = ','
- name = 'VectorFromColumn'
- parameters = VectorFromColumn.Parameters(delimiter=',')
- class optunaz.utils.preprocessing.transform.ZScales(name='ZScales', parameters=ZScales.Parameters())[source]
Bases:
AuxTransformer
Z-scales from column
Calculates Z-scores for sequences or a predefined list of peptide/protein targets
- name = 'ZScales'
- parameters = ZScales.Parameters()
- class optunaz.utils.preprocessing.transform.AmorProt(name='AmorProt', parameters=AmorProt.Parameters())[source]
Bases:
AuxTransformer
AmorProt from column
Calculates AmorProt for sequences or a predefined list of peptide/protein targets
- name = 'AmorProt'
- parameters = AmorProt.Parameters()