USPTO
=====

``rxnutils`` contain two pipelines that together downloads and prepares the USPTO reaction data so that it can be used on modelling.

It is a complete end-to-end pipeline that is designed to be transparent and reproducible.

Pre-requisites
--------------

The reason the pipeline is divided into two blocks is because the dependencies of the atom-mapper package (``rxnmapper``) is incompatible with 
the dependencies ``rxnutils`` package. Therefore, to be able to use to full pipeline, you need to setup two python environment. 

1. Install ``rxnutils`` according to the instructions in the `README`-file

2. Install ``rxnmapper`` according to the instructions in the repo: https://github.com/rxn4chemistry/rxnmapper


.. code-block::
            
    conda create -n rxnmapper python=3.6 -y
    conda activate rxnmapper
    conda install -c rdkit rdkit=2020.03.3.0
    python -m pip install rxnmapper


3. Install ``Metaflow`` and ``rxnutils`` in the new environment


.. code-block::

    python -m pip install metaflow
    python -m pip install --no-deps --ignore-requires-python . 


Usage
-----

Create a folder for the USPTO data and in that folder execute this command in the ``rxnutils`` environment


.. code-block::

    conda activate rxn-env
    python -m rxnutils.data.uspto.preparation_pipeline run --nbatches 200  --max-workers 8 --max-num-splits 200


and then in the environment with the ``rxnmapper`` run


.. code-block::

    conda activate rxnmapper
    python -m rxnutils.data.mapping_pipeline run --data-prefix uspto --nbatches 200  --max-workers 8 --max-num-splits 200


The ``-max-workers`` flag should be set to the number of CPUs available.

On 8 CPUs and 1 GPU the pipeline takes a couple of hours.


Artifacts
---------

The pipelines creates a number of `tab-separated` CSV files:

    * `1976_Sep2016_USPTOgrants_smiles.rsmi` and `2001_Sep2016_USPTOapplications_smiles.rsmi` is the original USPTO data downloaded from Figshare
    * `uspto_data.csv` is the combined USPTO data, with selected columns and a unique ID for each reaction
    * `uspto_data_cleaned.csv` is the cleaned and filter data
    * `uspto_data_mapped.csv` is the atom-mapped, modelling-ready data


The cleaning is done to be able to atom-map the reactions and are performing the following tasks:
    * Ignore extended SMILES information in the SMILES strings 
    * Remove molecules not sanitizable by RDKit
    * Remove reactions without any reactants or products 
    * Move all reagents to reactants
    * Remove the existing atom-mapping
    * Remove reactions with more than 200 atoms when summing reactants and products 

(the last is a requisite for ``rxnmapper`` that was trained on a maximum token size roughly corresponding to 200 atoms)


The ``uspo_data_mapped.csv`` files will have the following columns:

    * ID - unique ID created by concatenated patent number, paragraph and row index  in the original data file
    * Year - the year of the patent filing
    * ReactionSmiles - the original reaction SMILES
    * ReactionSmilesClean - the reaction SMILES after cleaning
    * BadMolecules - molecules not sanitizable by RDKit
    * ReactantSize - number of atoms in reactants
    * ProductSize - number of atoms in products
    * mapped_rxn - the mapped reaction SMILES
    * confidence - the confidence of the mapping as provided by ``rxnmapper``