Available splitters
Random
- class optunaz.utils.preprocessing.splitter.Random(name='Random', fraction=0.2, seed=1, leave_out=0.0)[source]
Random split.
- Parameters:
name (Literal) –
fraction (float) – Fraction of samples to use for test set. - minimum: 0.0, maximum: 0.999, title: Fraction samples
seed (Optional) – Seed for random number generator, for repeatable splits. - title: Seed for random number generator
leave_out (Optional) – Fraction of samples that will not be used in train or test set, to reduce compute time. - minimum: 0.0, maximum: 0.999, title: Leave out fraction
Temporal
- class optunaz.utils.preprocessing.splitter.Temporal(name='Temporal', fraction=0.2)[source]
Temporal split.
Assumes that the data is sorted, with the oldest entries in the beginning of the file, and the newest entries added at the end.
- split(X, y=None, groups=None)[source]
Splits input and returns indices for train and test sets.
Returns two numpy arrays: one with indices of train set, and one with indices of test set.
Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.
Stratified
- class optunaz.utils.preprocessing.splitter.Stratified(name='Stratified', fraction=0.2, seed=1, leave_out=0.0, bins='fd_merge')[source]
Real-valued Stratified Shuffle Split.
- Parameters:
name (Literal) –
fraction (float) – Fraction of samples to use for test set. - minimum: 0.0, maximum: 0.999, title: Fraction samples
seed (Optional) – Seed for random number generator, for repeatable splits. - title: Seed for random number generator
leave_out (Optional) – Fraction of samples that will not be used in train or test set, to reduce compute time. - minimum: 0.0, maximum: 0.999, title: Leave out fraction
bins (str) – Algorithm to use for determining histogram bin edges, see numpy.histogram for possible options, or use default ‘fd’ - title: Binning algorithm
This is similar to scikit-learn StratifiedShuffleSplit, but uses histogram binning for real-valued inputs.
If inputs are integers (or strings), this splitter reverts to StratifiedShuffleSplit.
Predefined
- class optunaz.utils.preprocessing.splitter.Predefined(column_name=None, name='Predefined')[source]
Predefined split.
- Parameters:
column_name (str) – Name of the column with labels for splits. Use -1 to denote datapoints for the train set - title: Column Name
name (Literal) –
Splits data based predefined labels in a column. Integers can be used, and -1 flags datapoints for use only in the training set. Data points with missing (NaN) values will be removed from train or test
- split(X, y=None, groups=None)[source]
Splits input and returns indices for train and test sets.
Returns two numpy arrays: one with indices of train set, and one with indices of test set.
Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.
ScaffoldSplit
- class optunaz.utils.preprocessing.splitter.ScaffoldSplit(bins='fd_merge', random_state=42, make_scaffold_generic=True, butina_cluster=0.4, name='ScaffoldSplit')[source]
Stratified Group K Fold based on chemical scaffold.
- Parameters:
bins (str) – Algorithm to use for determining histogram bin edges, see numpy.histogram for possible options, or use default ‘fd’ - title: Binning algorithm
random_state (Optional) –
make_scaffold_generic (bool) – Makes Murcko scaffolds generic by removing hetero-atoms - title: Make scaffold generic
butina_cluster (float) – Butina clustering to aggregate scaffolds into shared folds. Elements within this cluster range are considered neighbors, increasing test difficulty. 0.0 turns Butina clustering off - minimum: 0.0, maximum: 1.0, title: Cluster threshold
name (Literal) –
Splits data based chemical (Murcko) scaffolds for the compounds in the user input data. This emulates the real-world scenario when models are applied to novel chemical space
- split(X, y=None, groups=None)[source]
Splits input and returns indices for train and test sets.
Returns two numpy arrays: one with indices of train set, and one with indices of test set.
Note that scikit-learn splitters return an Iterator that yields (train, test) tuples for multiple splits, here we return only one split.