Configuration file

The configuration file that is to be provided to the interfaces specify file paths to policy models and stocks, in addition to detailed parameters of the tree search.

Simple usage

Let’s say you have a

  • A trained Keras expansion model that is called uspto_expansion.onnx

  • A library of unique templates called uspto_templates.csv.gz

  • A stock file in HDF5 format, called zinc_stock.hdf5

(this could have been created by the smiles2stock tool, see here)

then to use them in the tree search, create a file called config.yml that has the following content

expansion:
  full:
    - uspto_expansion.onnx
    - uspto_templates.csv.gz
stock:
  zinc: zinc_stock.hdf5

Advanced usage

A more detailed configuration file is shown below

search:
  algorithm: mcts
  algorithm_config:
    C: 1.4
    default_prior: 0.5
    use_prior: True
    prune_cycles_in_search: True
    search_rewards:
      - state score
  max_transforms: 6
  iteration_limit: 100
  return_first: false
  time_limit: 120
  exclude_target_from_stock: True
  break_bonds: [[1, 2], [2, 3]]
  freeze_bonds: [[3, 4]]
  break_bonds_operator: or
expansion:
  my_policy:
    type: template-based
    model: /path/to/keras/model/weights.hdf5
    template: /path/to/hdf5/templates.hdf5
    template_column: retro_template
    cutoff_cumulative: 0.995
    cutoff_number: 50
    use_rdchiral: True
    use_remote_models: False
  my_full_policy:
    - /path/to/keras/model/weights.hdf5
    - /path/to/hdf5/templates.hdf5
filter:
  uspto:
    type: quick-filter
    model: /path/to/keras/model/weights.hdf5
    exclude_from_policy: rc
    filter_cutoff: 0.05
    use_remote_models: False
  uspto_full: /path/to/keras/model/weights.hdf5
stock:
  buyables:
    type: inchiset
    path: /path/to/stock1.hdf5
  emolecules: /path/to/stock1.hdf5
The (expansion) policy models are specified using two files
  • a checkpoint files from Keras in ONNX or hdf5 format,

  • a HDF5 or a CSV file containing templates.

A key like my_policy should be set and the configuration contains type, model and template that must be provided. If the other settings are not assigned, their default values are taken. The policy my_full_policy exemplifies a short-cut to the template-based expansion model when no other settings need to be provided, only the model and templates can be provided. The default settings will be taken in this case.

The template file should be readable by pandas using the table key and the retro_template column. A policy can then be selected using the provided key, like my_policy in the above example.

The filter policy model is specified using a single checkpoint file. Any key like uspto can be set. The settings contains type and model that must be provided. If the other settings are not assigned, their default values are taken. The filter uspto_full exemplifies a short-cut to the quick-filter model when no other settings need to be provided, only the model can be provided. The default settings will be taken in this case.

The stock files can be
  • HDF5 files with the table key an the inchi_key column.

  • A CSV file with a inchi_key column

  • A text file a single column

In all cases, the column should contain pre-computed inchi keys of the molecules. The stocks can be set using any key, like buyables or emolecules in the above example. The type and path parameters can also be set along with other parameters. If no other settings need to be provided, only the path can be provided, whereby it will be treated as a short-cut to the inchiset class..

The values in the search sections are optional, and if missing, default values are considered. These values can also be taken from environment variables. An example of this can be seen as below:

search:
  iteration_limit: ${ITERATION_LIMIT}
  algorithm_config:
    C: ${C}
  time_limit: ${TIME_LIMIT}
  max_transforms: ${MAX_TRANSFORMS}

These are the available search settings. The algorithm_config refers to MCTS settings:

Setting

Default value

Description

algorithm

mcts

The search algorithm. Can be set to package.module.ClassName to use a custom search method.

algorithm_config: C

1.4

The C value used to balance exploitation and exploration in the upper confidence bound score of the nodes.

algorithm_config: default_prior

0.5

The prior that is used if policy-provided priors are not used.

algorithm_config: use_prior

True

If True, priors from the policy is used instead of the default_prior.

algorithm_config: prune_cycles_in_search

True

If True, prevents the MCTS from creating cycles by recreating previously seen molecules when it is expanded.

algorithm_config: search_rewards

[state score]

The scoring used for the MCTS search algorithm.

algorithm_config: search_rewards_weights

[]

The scoring weights used by the Combined Scorer for the MCTS search algorithm.

algorithm_config: immediate_instantiation

[]

list of expansion policies for which the MCTS algorithm immediately instantiate the children node upon expansion

algorithm_config: mcts_grouping

if is partial or full the MCTS algorithm will group expansions that produce the same state. If partial is used the equality will only be determined based on the expandable molecules, whereas full will check all molecules.

max_transforms

6

The maximum depth of the search tree.

iteration_limit

100

The maximum number of iterations for the tree search.

time_limit

120

The maximum number of seconds to complete the tree search.

return_first

False

If True, the tree search will be terminated as soon as one solution is found.

exclude_target_from_stock

True

If True, the target is in stock will be broken down.

break_bonds

[]

The list of lists of atom numbers of molecular bonds pairs to break during the search.

freeze_bonds

[]

The list of lists of atom numbers of molecular bonds pairs to freeze or retain during the search.

break_bonds_operator

and

If set to ‘and’, all bond pairs listed in break_bonds must be broken. If set to ‘or’, breaking any listed bond pair in break_bonds is sufficient.

The post_processing settings are:

Setting

Default value

Description

min_routes

5

The minumum number of routes to extract if all_routes is not set.

max_routes

25

The maximum number of routes to extract if all_routes is not set.

all_routes

False

If True, will extract all solved routes.

route_distance_model

N/A

If set, will load the quick route distance model from this checkpoint file.

route_scorer

state score

The scoring for routes when extracting them in the post processing step.

The expansion settings are for template-based models:

Setting

Default value

Description

template_column

retro_template

The column in the template file that contains the templates.

cutoff_cumulative

0.995

The accumulative probability of the suggested templates is capped at this value. All other templates above this threshold are discarded.

cutoff_number

50

The maximum number of templates that will be returned from the expansion policy.

use_rdchiral

True

If True, will apply templates with RDChiral, otherwise RDKit will be used.

use_remote_models

False

If True, will try to connect to remote Tensorflow servers.

rescale_prior

False

If True, will apply a softmax function to the priors.

mask

“”

The path to a numpy .npz file containing a Boolean vector of masks for the reaction templates.

The filter settings are for quick-filter models:

Setting

Default value

Description

exclude_from_policy

[]

The list of names of the filter policies to exclude.

filter_cutoff

0.05

The cut-off for the quick-filter policy.

use_remote_models

False

If True, will try to connect to remote Tensorflow servers.