Pipeline

rxnutils provide a simple pipeline to perform simple tasks on reaction SMILES and templates in a CSV-file.

The pipeline works on tab-separated CSV files (TSV files)

Usage

To exemplify the pipeline capabilities, we will have a look at the pipeline used to clean the USPTO data.

The input to the pipeline is a simple YAML-file that specifies each action to take. The actions will be executed sequentially, one after the other and each action takes a number of input arguments.

This is the YAML-file used to clean the USPTO data:

trim_rxn_smiles:
    in_column: ReactionSmiles
    out_column: ReactionSmilesClean
remove_unsanitizable:
    in_column: ReactionSmilesClean
    out_column: ReactionSmilesClean
reagents2reactants:
    in_column: ReactionSmilesClean
    out_column: ReactionSmilesClean
remove_atom_mapping:
    in_column: ReactionSmilesClean
    out_column: ReactionSmilesClean
reactantsize:
    in_column: ReactionSmilesClean
productsize:
    in_column: ReactionSmilesClean
query_dataframe1:
    query: "ReactantSize>0"
query_dataframe2:
    query: "ProductSize>0"
query_dataframe3:
    query: "ReactantSize+ProductSize<200"

The first action is called trim_rxn_smiles and two arguments are given: in_column specifying which column to use as input and out_column specifying which column to use as output.

The following actions remove_unsanitizable, reagents2reactants, remove_atom_mapping, reactantsize, productsize works the same way, but might use other columns to specified for output.

The last three actions are actually the same action but executed with different arguments. They therefore have to be postfixed with 1, 2 and 3. The action query_dataframe takes a query argument and removes a number of rows not matching the query.

If we save this to clean_pipeline.yml and given that we have a tab-separated file with USPTO data called uspto_data.csv we can run the following command

python -m rxnutils.pipeline.runner --pipeline clean_pipeline.yml --data uspto_data.csv --output uspto_cleaned.csv

or we can alternatively run it from a python method like this

from rxnutils.pipeline.runner import main as validation_runner

validation_runner(
    [
        "--pipeline",
        "clean_pipeline.yml",
        "--data",
        "uspto_data.csv",
        "--output",
        "uspto_cleaned.csv",
    ]
)

Actions

To find out what actions are available, you can type

python -m rxnutils.pipeline.runner --list

Development

New actions can easily be added to the pipeline framework. All of the actions are implemented in one of four modules

  • rxnutils.pipeline.actions.dataframe_mod - actions that modify the dataframe, e.g., removing rows or columns

  • rxnutils.pipeline.actions.reaction_mod - actions that modify reaction SMILES

  • rxnutils.pipeline.actions.dataframe_props - actions that compute properties from reaction SMILES

  • rxnutils.pipeline.actions.templates - actions that process reaction templates

To exemplify, let’s have a look at the productsize action

@action
@dataclass
class ProductSize:
"""Action for counting product size"""

pretty_name: ClassVar[str] = "productsize"
in_column: str
out_column: str = "ProductSize"

def __call__(self, data: pd.DataFrame) -> pd.DataFrame:
    smiles_col = global_apply(data, self._row_action, axis=1)
    return data.assign(**{self.out_column: smiles_col})

def __str__(self) -> str:
    return f"{self.pretty_name} (number of heavy atoms in product)"

def _row_action(self, row: pd.Series) -> str:
    _, _, products = row[self.in_column].split(">")
    products_mol = Chem.MolFromSmiles(products)

    if products_mol:
        product_atom_count = products_mol.GetNumHeavyAtoms()
    else:
        product_atom_count = 0

    return product_atom_count

The action is defined as a class ProductSize that has two class-decorators. The first @action will register the action in a global action list and second @dataclass is dataclass decorator from the standard library. The pretty_name class variable is used to identify the action in the pipeline, that is what you are specifying in the YAML-file. The other two in_column and out_column are the arguments you can specify in the YAML file for executing the action, they can have default values in case they don’t need to be specified in the YAML file.

When the action is executed by the pipeline the __call__ method is invoked with the current Pandas dataframe as the only argument. This method should return the modified dataframe.

Lastly, it is nice to implement a __str__ method which is used by the pipeline to print useful information about the action that is executed.