Available splitters

Random

class optunaz.utils.preprocessing.splitter.Random(name='Random', fraction=0.2, seed=1, leave_out=0.0)[source]

Random split.

Parameters:

name (Literal) –
fraction (float) – Fraction of samples to use for test set. - minimum: 0.0, maximum: 0.999, title: Fraction samples
seed (Optional) – Seed for random number generator, for repeatable splits. - title: Seed for random number generator
leave_out (Optional) – Fraction of samples that will not be used in train or test set, to reduce compute time. - minimum: 0.0, maximum: 0.999, title: Leave out fraction

Temporal

class optunaz.utils.preprocessing.splitter.Temporal(name='Temporal', fraction=0.2)[source]

Temporal split.

Assumes that the data is sorted, with the oldest entries in the beginning of the file, and the newest entries added at the end.

split(X, y=None, groups=None)[source]

Splits input and returns indices for train and test sets.

Returns two numpy arrays: one with indices of train set, and one with indices of test set.

Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.

Stratified

class optunaz.utils.preprocessing.splitter.Stratified(name='Stratified', fraction=0.2, seed=1, leave_out=0.0, bins='fd_merge')[source]

Real-valued Stratified Shuffle Split.

Parameters:

name (Literal) –
fraction (float) – Fraction of samples to use for test set. - minimum: 0.0, maximum: 0.999, title: Fraction samples
seed (Optional) – Seed for random number generator, for repeatable splits. - title: Seed for random number generator
leave_out (Optional) – Fraction of samples that will not be used in train or test set, to reduce compute time. - minimum: 0.0, maximum: 0.999, title: Leave out fraction
bins (str) – Algorithm to use for determining histogram bin edges, see numpy.histogram for possible options, or use default ‘fd’ - title: Binning algorithm

This is similar to scikit-learn StratifiedShuffleSplit, but uses histogram binning for real-valued inputs.

If inputs are integers (or strings), this splitter reverts to StratifiedShuffleSplit.

Predefined

class optunaz.utils.preprocessing.splitter.Predefined(column_name=None, name='Predefined')[source]

Predefined split.

Parameters:

column_name (str) – Name of the column with labels for splits. Use -1 to denote datapoints for the train set - title: Column Name
name (Literal) –

Splits data based predefined labels in a column. Integers can be used, and -1 flags datapoints for use only in the training set. Data points with missing (NaN) values will be removed from train or test

split(X, y=None, groups=None)[source]

Splits input and returns indices for train and test sets.

Returns two numpy arrays: one with indices of train set, and one with indices of test set.

Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.

ScaffoldSplit

class optunaz.utils.preprocessing.splitter.ScaffoldSplit(bins='fd_merge', random_state=42, make_scaffold_generic=True, butina_cluster=0.4, name='ScaffoldSplit')[source]

Stratified Group K Fold based on chemical scaffold.

Parameters:

bins (str) – Algorithm to use for determining histogram bin edges, see numpy.histogram for possible options, or use default ‘fd’ - title: Binning algorithm
random_state (Optional) –
make_scaffold_generic (bool) – Makes Murcko scaffolds generic by removing hetero-atoms - title: Make scaffold generic
butina_cluster (float) – Butina clustering to aggregate scaffolds into shared folds. Elements within this cluster range are considered neighbors, increasing test difficulty. 0.0 turns Butina clustering off - minimum: 0.0, maximum: 1.0, title: Cluster threshold
name (Literal) –

Splits data based chemical (Murcko) scaffolds for the compounds in the user input data. This emulates the real-world scenario when models are applied to novel chemical space

split(X, y=None, groups=None)[source]

Splits input and returns indices for train and test sets.

Returns two numpy arrays: one with indices of train set, and one with indices of test set.

Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.

groups(df, smiles_col)[source]: Calculate scaffold smiles from a smiles column