USPTO¶

rxnutils contain two pipelines that together downloads and prepares the USPTO reaction data so that it can be used on modelling.

It is a complete end-to-end pipeline that is designed to be transparent and reproducible.

Pre-requisites¶

The reason the pipeline is divided into two blocks is because the dependencies of the atom-mapper package (rxnmapper) is incompatible with the dependencies rxnutils package. Therefore, to be able to use to full pipeline, you need to setup two python environment.

Install rxnutils according to the instructions in the README-file
Install rxnmapper according to the instructions in the repo: https://github.com/rxn4chemistry/rxnmapper

conda create -n rxnmapper python=3.6 -y
conda activate rxnmapper
conda install -c rdkit rdkit=2020.03.3.0
python -m pip install rxnmapper

Install Metaflow and rxnutils in the new environment

python -m pip install metaflow
python -m pip install --no-deps --ignore-requires-python .

Usage¶

Create a folder for the USPTO data and in that folder execute this command in the rxnutils environment

conda activate rxn-env
python -m rxnutils.data.uspto.preparation_pipeline run --nbatches 200  --max-workers 8 --max-num-splits 200

and then in the environment with the rxnmapper run

conda activate rxnmapper
python -m rxnutils.data.mapping_pipeline run --data-prefix uspto --nbatches 200  --max-workers 8 --max-num-splits 200

The -max-workers flag should be set to the number of CPUs available.

On 8 CPUs and 1 GPU the pipeline takes a couple of hours.

Artifacts¶

The pipelines creates a number of tab-separated CSV files:

1976_Sep2016_USPTOgrants_smiles.rsmi and 2001_Sep2016_USPTOapplications_smiles.rsmi is the original USPTO data downloaded from Figshare

uspto_data.csv is the combined USPTO data, with selected columns and a unique ID for each reaction

uspto_data_cleaned.csv is the cleaned and filter data

uspto_data_mapped.csv is the atom-mapped, modelling-ready data

The cleaning is done to be able to atom-map the reactions and are performing the following tasks:

Ignore extended SMILES information in the SMILES strings
Remove molecules not sanitizable by RDKit
Remove reactions without any reactants or products
Move all reagents to reactants
Remove the existing atom-mapping
Remove reactions with more than 200 atoms when summing reactants and products

(the last is a requisite for rxnmapper that was trained on a maximum token size roughly corresponding to 200 atoms)

The uspo_data_mapped.csv files will have the following columns:

ID - unique ID created by concatenated patent number, paragraph and row index in the original data file

Year - the year of the patent filing

ReactionSmiles - the original reaction SMILES

ReactionSmilesClean - the reaction SMILES after cleaning

BadMolecules - molecules not sanitizable by RDKit

ReactantSize - number of atoms in reactants

ProductSize - number of atoms in products

mapped_rxn - the mapped reaction SMILES

confidence - the confidence of the mapping as provided by rxnmapper

ReactionUtils

Navigation

Related Topics

USPTO¶

Pre-requisites¶

Usage¶

Artifacts¶