Open reaction database¶
rxnutils
contain two pipelines that together imports and prepares the reaction data from the Open reaction database so that it can be used on modelling.
It is a complete end-to-end pipeline that is designed to be transparent and reproducible.
Pre-requisites¶
The reason the pipeline is divided into two blocks is because the dependencies of the atom-mapper package (rxnmapper
) is incompatible with
the dependencies rxnutils
package. Therefore, to be able to use to full pipeline, you need to setup two python environment.
Install
rxnutils
according to the instructions in the README-fileInstall the
ord-schema
package in the `` rxnutils`` environmentconda activate rxn-env python -m pip install ord-schema
Download/Clone the
ord-data
repository according to the instructions here: https://github.com/Open-Reaction-Database/ord-data
Note down the path to the repository as this needs to be given to the preparation pipeline
Install
rxnmapper
according to the instructions in the repo: https://github.com/rxn4chemistry/rxnmapper
conda create -n rxnmapper python=3.6 -y
conda activate rxnmapper
conda install -c rdkit rdkit=2020.03.3.0
python -m pip install rxnmapper
Install
Metaflow
andrxnutils
in the new environment
python -m pip install metaflow
python -m pip install --no-deps --ignore-requires-python .
Usage¶
Create a folder for the ORD data and in that folder execute this command in the rxnutils
environment
conda activate rxn-env
python -m rxnutils.data.ord.preparation_pipeline run --nbatches 200 --max-workers 8 --max-num-splits 200 --ord-data ORD_DATA_REPO_PATH
and then in the environment with the rxnmapper
run
conda activate rxnmapper
python -m rxnutils.data.mapping_pipeline run --data-prefix ord --nbatches 200 --max-workers 8 --max-num-splits 200
The -max-workers
flag should be set to the number of CPUs available.
On 8 CPUs and 1 GPU the pipeline takes a couple of hours.
Artifacts¶
The pipelines creates a number of tab-separated CSV files:
ord_data.csv is the imported ORD data
ord_data_cleaned.csv is the cleaned and filter data
ord_data_mapped.csv is the atom-mapped, modelling-ready data
- The cleaning is done to be able to atom-map the reactions and are performing the following tasks:
Ignore extended SMILES information in the SMILES strings
Remove molecules not sanitizable by RDKit
Remove reactions without any reactants or products
Move all reagents to reactants
Remove the existing atom-mapping
Remove reactions with more than 200 atoms when summing reactants and products
(the last is a requisite for rxnmapper
that was trained on a maximum token size roughly corresponding to 200 atoms)
The ord_data_mapped.csv
files will have the following columns:
ID - unique ID from the original database
Dataset - the name of the dataset from which this is reaction is taken
Date - the date of the experiment as given in the database
ReactionSmiles - the original reaction SMILES
Yield - the yield of the first product of the first outcome, if provided
ReactionSmilesClean - the reaction SMILES after cleaning
BadMolecules - molecules not sanitizable by RDKit
ReactantSize - number of atoms in reactants
ProductSize - number of atoms in products
mapped_rxn - the mapped reaction SMILES
confidence - the confidence of the mapping as provided by
rxnmapper