Open reaction database¶
contain two pipelines that together imports and prepares the reaction data from the Open reaction database so that it can be used on modelling.
It is a complete end-to-end pipeline that is designed to be transparent and reproducible.
The reason the pipeline is divided into two blocks is because the dependencies of the atom-mapper package (rxnmapper
) is incompatible with
the dependencies rxnutils
package. Therefore, to be able to use to full pipeline, you need to setup two python environment.
according to the instructions in the README-fileInstall the
package in the `` rxnutils`` environmentconda activate rxn-env python -m pip install ord-schema
Download/Clone the
repository according to the instructions here:
Note down the path to the repository as this needs to be given to the preparation pipeline
according to the instructions in the repo:
conda create -n rxnmapper python=3.6 -y
conda activate rxnmapper
conda install -c rdkit rdkit=2020.03.3.0
python -m pip install rxnmapper
in the new environment
python -m pip install metaflow
python -m pip install --no-deps --ignore-requires-python .
Create a folder for the ORD data and in that folder execute this command in the rxnutils
conda activate rxn-env
python -m run --nbatches 200 --max-workers 8 --max-num-splits 200 --ord-data ORD_DATA_REPO_PATH
and then in the environment with the rxnmapper
conda activate rxnmapper
python -m run --data-prefix ord --nbatches 200 --max-workers 8 --max-num-splits 200
The -max-workers
flag should be set to the number of CPUs available.
On 8 CPUs and 1 GPU the pipeline takes a couple of hours.
The pipelines creates a number of tab-separated CSV files:
ord_data.csv is the imported ORD data
ord_data_cleaned.csv is the cleaned and filter data
ord_data_mapped.csv is the atom-mapped, modelling-ready data
- The cleaning is done to be able to atom-map the reactions and are performing the following tasks:
Ignore extended SMILES information in the SMILES strings
Remove molecules not sanitizable by RDKit
Remove reactions without any reactants or products
Move all reagents to reactants
Remove the existing atom-mapping
Remove reactions with more than 200 atoms when summing reactants and products
(the last is a requisite for rxnmapper
that was trained on a maximum token size roughly corresponding to 200 atoms)
The ord_data_mapped.csv
files will have the following columns:
ID - unique ID from the original database
Dataset - the name of the dataset from which this is reaction is taken
Date - the date of the experiment as given in the database
ReactionSmiles - the original reaction SMILES
Yield - the yield of the first product of the first outcome, if provided
ReactionSmilesClean - the reaction SMILES after cleaning
BadMolecules - molecules not sanitizable by RDKit
ReactantSize - number of atoms in reactants
ProductSize - number of atoms in products
mapped_rxn - the mapped reaction SMILES
confidence - the confidence of the mapping as provided by