Docking workflow#

This is a quick guide to setting up a simple linear workflow for docking small molecules.

Imports#

We will first import generally useful stuff - maize uses pathlib whenever file system paths are involved:

[1]:

from pathlib import Path

The core functionality of maize is contained in the Workflow class.

[2]:

from maize.core.workflow import Workflow

We will also need some steps that allow us to load and save any kind of data:

[3]:

from maize.steps.io import LoadData, LogResult, Return, Void

Domain-specific steps are contained in the maize-contrib namespace package. We will need a docking step (AutoDockGPU) and a step to generate small molecule isomers / conformers (Gypsum).

[4]:

from maize.steps.mai.docking.adv import AutoDockGPU
from maize.steps.mai.molecule import Gypsum
from maize.utilities.chem import IsomerCollection

Preparation#

Configuration#

Most available workflow nodes in Maize require some system-specific configuration, such as the location of external software packages, names of modules, or python packages. This configuration takes place in a TOML file, by default Maize looks in $XDG_CONFIG_HOME/maize.toml (~/.config/maize.toml) for a file like this:

[5]:

!cat docking-example-config.toml


[autodockgpu]
python = "/projects/mai/users/${USER}_thomas/opt/miniconda3/envs/maize-dev/bin/python"
modules = ["CUDA", "GCC"]
commands.autodock_gpu = "/projects/mai/users/${USER}_thomas/src/AutoDock-GPU/bin/autodock_gpu_64wi"

[gypsum]
scripts.gypsum.interpreter = "/projects/mai/users/${USER}_thomas/opt/miniconda3/envs/gypsum/bin/python"
scripts.gypsum.location = "/projects/mai/users/${USER}_thomas/src/gypsum_dl/run_gypsum_dl.py"

Here, we have configured the AutoDockGPU node by specifying the python interpreter to be used for its execution, the modules that need to be loaded before, as well as the precise location of the autodock_gpu command. We also need this information for Gypsum (to embed molecules), but since it’s a script, we provide the script location and interpreter to use separately.

How to provide this documentation is described in the documentation for each node. They will often have a required_callables class attribute specifying the name of the callable to be provided in the configuration. In the above example, AutoDockGPU requires a callable named autodock_gpu, the path to which we can provide using commands.autodock_gpu. In other cases the executable might already be in your $PATH (or loaded through the module system), in which case no additional configuration is required.

Grid#

We also need to specify the target grid file. AutoDockGPU requires an .fld file with references to a few other files to operate correctly.

[6]:

grid = Path("../maize/steps/mai/docking/data/1stp/1stp_protein.maps.fld")

Workflow definition#

We can now create our workflow instance, give it a name, and possibly specify a logging level. cleanup_temp=False allows us to keep all files related to the execution instead of just using temporary directories. We also load our custom configuration because it’s in a non-standard location.

[7]:

flow = Workflow(name="dock", level="info", cleanup_temp=False)
flow.config.update(Path("docking-example-config.toml"))

Adding nodes#

Workflow nodes can be added using the add method. We will need a node to load data and inject it into the workflow (LoadData), the molecule preparation and docking steps (Gypsum and AutoDockGPU), and steps to handle the output (Void and Return).

Note that steps that accept any kind of generic data should be parameterised with the type. This is the case for LoadData, in this case we will be passing it a list of SMILES codes (list[str]), and also for Return. Doing this allows for all connections to be type-checked statically, catching many workflow construction errors.

[8]:

load = flow.add(LoadData[list[str]])
embe = flow.add(Gypsum)
dock = flow.add(AutoDockGPU)
void = flow.add(Void)
retu = flow.add(Return[list[IsomerCollection]])

Configuration#

We can now set the required parameters. LoadData requires setting the data parameter with the data we want to send on to the next node. We also specify that we want a maximum of 2 variants for the isomer generation / embedding step, and we also need to specify the docking grid:

[9]:

load.data.set(["Nc1ccc(ccc1N)C", "Nc1ccc(cc1N)C"])
embe.n_variants.set(2)
dock.grid_file.set(grid)

Connections#

All that is left to do now is to connect all nodes together:

[10]:

flow.connect(load.out, embe.inp)
flow.connect(embe.out, dock.inp)
flow.connect(dock.out, retu.inp)
flow.connect(dock.out_scores, void.inp)

We can check if everything is ready to run using the check() method. It will ensure all nodes in the graph are connected properly (with correct types) and also go through each node’s software requirements, making sure it’s all available:

[11]:

flow.check()

If you installed maize with graphviz you can visualize the workflow in your notebook:

[12]:

flow.visualize()

[12]:

Running#

We can now run our graph!

[13]:

flow.execute()

2023-09-05 10:46:50,129 |     INFO |             dock |
              ___           ___                       ___           ___
             /\__\         /\  \          ___        /\  \         /\  \
            /::|  |       /::\  \        /\  \       \:\  \       /::\  \
           /:|:|  |      /:/\:\  \       \:\  \       \:\  \     /:/\:\  \
          /:/|:|__|__   /::\~\:\  \      /::\__\       \:\  \   /::\~\:\  \
         /:/ |::::\__\ /:/\:\ \:\__\  __/:/\/__/ _______\:\__\ /:/\:\ \:\__\
         \/__/~~/:/  / \/__\:\/:/  / /\/:/  /    \::::::::/__/ \:\~\:\ \/__/
               /:/  /       \::/  /  \::/__/      \:\~~\~~      \:\ \:\__\
              /:/  /        /:/  /    \:\__\       \:\  \        \:\ \/__/
             /:/  /        /:/  /      \/__/        \:\__\        \:\__\
             \/__/         \/__/                     \/__/         \/__/


2023-09-05 10:46:50,130 |     INFO |             dock | Starting Maize version 0.4.1 (c) AstraZeneca 2023
2023-09-05 10:46:51,326 |     INFO |             dock | Node 'loaddata' finished (1/5)
2023-09-05 10:46:52,144 |     INFO |             dock | Node 'gypsum' finished (2/5)
2023-09-05 10:46:52,294 |  WARNING |      autodockgpu | Docking isomer 'UPIWLIWRZMYFKW-HTQZYQBONA-N' failed
2023-09-05 10:46:52,301 |     INFO |      autodockgpu | Parsed isomer 'UPIWLIWRZMYFKW-ITHFDTRPNA-P', score -3.27
2023-09-05 10:46:52,341 |     INFO |      autodockgpu | Parsed isomer 'DGRGLKZMKWPMOH-UHFFFAOYNA-N', score -4.43
2023-09-05 10:46:52,377 |     INFO |             dock | Node 'return' finished (3/5)
2023-09-05 10:46:53,383 |     INFO |             dock | Node 'autodockgpu' finished (4/5)
2023-09-05 10:46:53,853 |     INFO |             dock | Node 'void' finished (5/5)
2023-09-05 10:46:54,356 |     INFO |             dock | Execution completed :), total runtime: 0:00:03.674147
        4 nodes completed successfully
        1 nodes stopped due to closing ports
        0 nodes failed
        0:00:09.299583 total walltime
        0:00:04.741414 spent waiting for resources or other nodes

Results#

If all went well you should see a summary of the workflow run in the log above. LogResult recorded the docking scores and printed them to the log, but we also have access to the docked conformations through the Return node. To access this data, just call get():

[14]:

mols = retu.get()
mols

[14]:

[IsomerCollection('CC1CC[C@H]([NH3+])[C@@H]([NH3+])CC1', n_isomers=2),
 IsomerCollection('Cc1ccc(N)c(N)c1', n_isomers=1, best_score=-4.43)]

Here, IsomerCollection is a simple container around Isomer objects, which in turn are pythonic wrappers around RDKit molecules. These can be accessed using the _molecule attribute:

[15]:

mols[0].molecules[0]._molecule

[15]:


UniqueID	1
SMILES	CC1CC[C@@H](N)[C@H](N)CC1
Energy	35.402799745012786
Genealogy	CC1CC[C@@H](N)[C@H](N)CC1 (chirality) CC1CC[C@@H](N)[C@H](N)CC1 (3D coordinates assigned) CC1CC[C@@H](N)[C@H](N)CC1 (nonaromatic ring conformer: 35.402799745012786 kcal/mol)
energy	[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

Adding a custom node#

In the above workflow, we just discarded the docking scores of the molecules and instead used only the docked conformations. Lets create our own node that will log the docked scores. We will first import some required classes to allow us to define our node:

[16]:

import numpy as np
from numpy.typing import NDArray

from maize.core.interface import Input
from maize.core.node import Node

Defining nodes#

We define our custom node by creating a new class inheriting from the node class. Every workflow node must have at least one port (Input and / or Output) and a run method describing the logic. Ports must also declare the correct type, to minimize errors in graph construction.

In this node, we just receive the scores (a blocking call) and then print them out using the built-in logger.

[17]:

class ScoreLog(Node):
    """Logs scores in the form of NDArrays"""

    inp: Input[NDArray[np.float32]] = Input()

    def run(self) -> None:
        scores = self.inp.receive()
        self.logger.info("Received scores: %s", scores)

Workflow#

We can now build up our workflow as normal, substituting our ScoreLog node for the Void node:

[18]:

flow = Workflow(name="dock", level="info", cleanup_temp=False)
flow.config.update(Path("docking-example-config.toml"))

[19]:

load = flow.add(LoadData[list[str]])
embe = flow.add(Gypsum)
dock = flow.add(AutoDockGPU)
save = flow.add(ScoreLog)
retu = flow.add(Return[list[IsomerCollection]])

Our new node has no parameters, so everything stays the same here:

[20]:

load.data.set(["Nc1ccc(ccc1N)C", "Nc1ccc(cc1N)C"])
embe.n_variants.set(2)
dock.grid_file.set(grid)

We again connect everything together:

[21]:

flow.connect(load.out, embe.inp)
flow.connect(embe.out, dock.inp)
flow.connect(dock.out, retu.inp)
flow.connect(dock.out_scores, save.inp)

And check we didn’t make a mistake:

[22]:

flow.check()

Running#

Let’s run the new workflow, we should see the scores getting logged by our own node:

[23]:

flow.execute()

2023-09-05 10:46:55,671 |     INFO |             dock |
              ___           ___                       ___           ___
             /\__\         /\  \          ___        /\  \         /\  \
            /::|  |       /::\  \        /\  \       \:\  \       /::\  \
           /:|:|  |      /:/\:\  \       \:\  \       \:\  \     /:/\:\  \
          /:/|:|__|__   /::\~\:\  \      /::\__\       \:\  \   /::\~\:\  \
         /:/ |::::\__\ /:/\:\ \:\__\  __/:/\/__/ _______\:\__\ /:/\:\ \:\__\
         \/__/~~/:/  / \/__\:\/:/  / /\/:/  /    \::::::::/__/ \:\~\:\ \/__/
               /:/  /       \::/  /  \::/__/      \:\~~\~~      \:\ \:\__\
              /:/  /        /:/  /    \:\__\       \:\  \        \:\ \/__/
             /:/  /        /:/  /      \/__/        \:\__\        \:\__\
             \/__/         \/__/                     \/__/         \/__/


2023-09-05 10:46:55,673 |     INFO |             dock | Starting Maize version 0.4.1 (c) AstraZeneca 2023
2023-09-05 10:46:56,899 |     INFO |             dock | Node 'loaddata' finished (1/5)
2023-09-05 10:46:57,717 |     INFO |             dock | Node 'gypsum' finished (2/5)
2023-09-05 10:46:58,259 |     INFO |      autodockgpu | Parsed isomer 'UPIWLIWRZMYFKW-NNELSRSVNA-P', score -2.7
2023-09-05 10:46:58,302 |     INFO |      autodockgpu | Parsed isomer 'UPIWLIWRZMYFKW-ITHFDTRPNA-P', score -3.45
2023-09-05 10:46:58,328 |     INFO |      autodockgpu | Parsed isomer 'DGRGLKZMKWPMOH-UHFFFAOYNA-N', score -4.4
2023-09-05 10:46:58,355 |     INFO |             dock | Node 'return' finished (3/5)
2023-09-05 10:46:58,887 |     INFO |             dock | Node 'scorelog' finished (4/5)
2023-09-05 10:46:58,883 |     INFO |         scorelog | Received scores: [-3.45 -4.4 ]
2023-09-05 10:46:59,384 |     INFO |             dock | Node 'autodockgpu' finished (5/5)
2023-09-05 10:46:59,887 |     INFO |             dock | Execution completed :), total runtime: 0:00:03.664648
        5 nodes completed successfully
        0 nodes stopped due to closing ports
        0 nodes failed
        0:00:10.142232 total walltime
        0:00:08.158412 spent waiting for resources or other nodes

High-throughput docking#

The workflow above is a classic example of a directed acyclic graph (DAG). Each step is effectively run sequentially, which is not a problem for a small number of molecules, but can become a problem for very large amounts, as we might spend a lot of time embedding molecules, while we could already start docking some of them.

In Maize we can sidestep this problem by batching our data and sending it through the same workflow repeatedly. To do this we will need two generic helper nodes, Batch and Combine:

[24]:

import numpy as np
from numpy.typing import NDArray

from maize.steps.plumbing import Batch, Combine
from maize.steps.io import Void

Workflow#

We can now again build our workflow, we will start just like before:

[25]:

flow = Workflow(name="dock-ht", level="info", cleanup_temp=False)
flow.config.update(Path("docking-example-config.toml"))

We again add all required nodes just like before, as well as the Batch and Combine nodes. The former will receive a list of items and split it into smaller lists, while the latter will perform the opposite. This will allow us to send smaller batches of SMILES to be docked, allowing molecules to be embedded on the CPU while the previous batch is already docking on the GPU. To make this possible, both Gypsum and AutoDockGPU will need to be able to continuously accept data, we can set this option with loop=True. We will finally simplify things a bit and just send the docked molecule conformations to a Void node, discarding them (equivalent to /dev/null) and only keeping the scores.

[26]:

load = flow.add(LoadData[list[str]])
batch = flow.add(Batch[str])
embe = flow.add(Gypsum, loop=True)
dock = flow.add(AutoDockGPU, loop=True)
comb = flow.add(Combine[NDArray[np.float32]])
save = flow.add(LogResult)
void = flow.add(Void)

Configuration#

We again configure our workflow as before:

[27]:

load.data.set(["Nc1ccc(ccc1N)C", "Nc1ccc(cc1N)C", "Nc1cc(F)c(cc1N)C", "Nc1ccc(cc1N)C(F)"])
embe.n_variants.set(2)
dock.grid_file.set(grid)

However, we now also need to set the number of batches to use, and we need to set the same value for both Batch and Combine. Instead of setting both individually we can combine both parameters into a single one to make the setting less error-prone using combine_parameters:

[28]:

n_batches = flow.combine_parameters(batch.n_batches, comb.n_batches)
n_batches.set(2)

We again connect everything, but this time take a shortcut by using connect_all:

[29]:

flow.connect_all(
    (load.out, batch.inp),
    (batch.out, embe.inp),
    (embe.out, dock.inp),
    (dock.out, void.inp),
    (dock.out_scores, comb.inp),
    (comb.out, save.inp),
)

We check everything’s okay and look at the graph structure:

[30]:

flow.check()
flow.visualize()

[30]:

Run#

Let’s run it!

[31]:

flow.execute()

2023-09-05 10:47:01,146 |     INFO |          dock-ht |
              ___           ___                       ___           ___
             /\__\         /\  \          ___        /\  \         /\  \
            /::|  |       /::\  \        /\  \       \:\  \       /::\  \
           /:|:|  |      /:/\:\  \       \:\  \       \:\  \     /:/\:\  \
          /:/|:|__|__   /::\~\:\  \      /::\__\       \:\  \   /::\~\:\  \
         /:/ |::::\__\ /:/\:\ \:\__\  __/:/\/__/ _______\:\__\ /:/\:\ \:\__\
         \/__/~~/:/  / \/__\:\/:/  / /\/:/  /    \::::::::/__/ \:\~\:\ \/__/
               /:/  /       \::/  /  \::/__/      \:\~~\~~      \:\ \:\__\
              /:/  /        /:/  /    \:\__\       \:\  \        \:\ \/__/
             /:/  /        /:/  /      \/__/        \:\__\        \:\__\
             \/__/         \/__/                     \/__/         \/__/


2023-09-05 10:47:01,148 |     INFO |          dock-ht | Starting Maize version 0.4.1 (c) AstraZeneca 2023
2023-09-05 10:47:01,787 |     INFO |            batch | Sending batch 0/2
2023-09-05 10:47:02,289 |     INFO |          dock-ht | Node 'loaddata' finished (1/7)
2023-09-05 10:47:02,290 |     INFO |            batch | Sending batch 1/2
2023-09-05 10:47:02,793 |     INFO |          dock-ht | Node 'batch' finished (2/7)
2023-09-05 10:47:03,623 |  WARNING |      autodockgpu | Docking isomer 'UPIWLIWRZMYFKW-YUMQZZPRNA-N' failed
2023-09-05 10:47:03,627 |  WARNING |      autodockgpu | Docking isomer 'UPIWLIWRZMYFKW-WHUPJOBBNA-N' failed
2023-09-05 10:47:03,632 |     INFO |      autodockgpu | Parsed isomer 'DGRGLKZMKWPMOH-UHFFFAOYNA-N', score -4.43
2023-09-05 10:47:04,186 |     INFO |          combine | Received batch 1/2
2023-09-05 10:47:05,273 |     INFO |          dock-ht | Node 'gypsum' finished (3/7)
2023-09-05 10:47:05,426 |  WARNING |      autodockgpu | Docking isomer 'SNHYTHIEXHWXCQ-UHFFFAOYNA-N' failed
2023-09-05 10:47:05,428 |  WARNING |      autodockgpu | Docking isomer 'BLUAPHUUFWSBEL-UHFFFAOYNA-N' failed
2023-09-05 10:47:05,938 |     INFO |          combine | Received batch 2/2
2023-09-05 10:47:05,944 |     INFO |          dock-ht | Node 'logresult' finished (4/7)
2023-09-05 10:47:05,941 |     INFO |        logresult | Received data: '[nan, -4.43, nan, nan]'
2023-09-05 10:47:06,440 |     INFO |          dock-ht | Node 'autodockgpu' finished (5/7)
2023-09-05 10:47:06,442 |     INFO |          dock-ht | Node 'combine' finished (6/7)
2023-09-05 10:47:06,792 |     INFO |          dock-ht | Node 'void' finished (7/7)
2023-09-05 10:47:07,294 |     INFO |          dock-ht | Execution completed :), total runtime: 0:00:05.570066
        4 nodes completed successfully
        3 nodes stopped due to closing ports
        0 nodes failed
        0:00:23.166020 total walltime
        0:00:13.990148 spent waiting for resources or other nodes

Parallel docking#

While the above example allowed us to increase our throughput, we can go even further by explicitly parallelizing our workflow. In this case, we will want to run multiple docking runs in parallel, e.g. through a job submission system such as SLURM or just on a machine with multiple GPUs.

We will use the parallel macro as a shortcut for this purpose, it will allow us to run a node in parallel and distribute data across it.

[32]:

from maize.utilities.macros import parallel

Workflow#

We again setup our workflow as normal:

[33]:

flow = Workflow(name="dock-parallel", level="info", cleanup_temp=False)
flow.config.update(Path("docking-example-config.toml"))

We can now add our nodes just as in the first example, but we now use the magical parallel function to convert our single AutoDockGPU node into a subgraph containing multiple copies of it. Data will be distributed using a RoundRobin node, i.e., one item will be sent to each output at a time, cycling through all available outputs. Outputs will be connected accordingly with a Merge node. If more than one output or input is present, multiple RoundRobin or Merge nodes will be used. Finally, we also need to set loop=True here to make sure the AutoDockGPU node can accept multiple rounds of inputs.

[34]:

load = flow.add(LoadData[list[str]])
embe = flow.add(Gypsum, loop=True)
dock = flow.add(parallel(AutoDockGPU, n_branches=2, inputs=("inp",), outputs=("out", "out_scores"), loop=True))
save = flow.add(LogResult)
void = flow.add(Void)

Configuration#

Configuration is done just like before, the embedded molecules will then be distributed over both AutoDockGPU nodes.

[35]:

load.data.set(["Nc1ccc(ccc1N)C", "Nc1ccc(cc1N)C", "Nc1cc(F)c(cc1N)C", "Nc1ccc(cc1N)C(F)"])
embe.n_variants.set(2)
dock.grid_file.set(grid)

We can again connect everything as normal, the internal connections for distribution and merging are taken care of by the parallel macro above.

[36]:

flow.connect_all(
    (load.out, embe.inp),
    (embe.out, dock.inp),
    (dock.out, void.inp),
    (dock.out_scores, save.inp),
)

Let’s check we’re good to go, and also visualize what is going on inside parallel!

[37]:

flow.check()
flow.visualize()

[37]:

Let’s go! Because my system only has a single GPU we won’t actually see any speed up though (execution is blocked if required resources are not available).

[38]:

flow.execute()

2023-09-05 10:47:08,824 |     INFO |    dock-parallel |
              ___           ___                       ___           ___
             /\__\         /\  \          ___        /\  \         /\  \
            /::|  |       /::\  \        /\  \       \:\  \       /::\  \
           /:|:|  |      /:/\:\  \       \:\  \       \:\  \     /:/\:\  \
          /:/|:|__|__   /::\~\:\  \      /::\__\       \:\  \   /::\~\:\  \
         /:/ |::::\__\ /:/\:\ \:\__\  __/:/\/__/ _______\:\__\ /:/\:\ \:\__\
         \/__/~~/:/  / \/__\:\/:/  / /\/:/  /    \::::::::/__/ \:\~\:\ \/__/
               /:/  /       \::/  /  \::/__/      \:\~~\~~      \:\ \:\__\
              /:/  /        /:/  /    \:\__\       \:\  \        \:\ \/__/
             /:/  /        /:/  /      \/__/        \:\__\        \:\__\
             \/__/         \/__/                     \/__/         \/__/


2023-09-05 10:47:08,826 |     INFO |    dock-parallel | Starting Maize version 0.4.1 (c) AstraZeneca 2023
2023-09-05 10:47:09,829 |     INFO |    dock-parallel | Node 'loaddata' finished (1/9)
2023-09-05 10:47:11,065 |     INFO |    dock-parallel | Node 'gypsum' finished (2/9)
2023-09-05 10:47:11,076 |     INFO |    dock-parallel | Node 'sow-inp' finished (3/9)
2023-09-05 10:47:11,208 |     INFO |    dock-parallel | Node 'AutoDockGPU-1' finished (4/9)
2023-09-05 10:47:11,834 |     INFO |    AutoDockGPU-0 | Parsed isomer 'UPIWLIWRZMYFKW-ITHFDTRPNA-P', score -3.36
2023-09-05 10:47:11,875 |     INFO |    AutoDockGPU-0 | Parsed isomer 'UPIWLIWRZMYFKW-NNELSRSVNA-P', score -2.84
2023-09-05 10:47:11,900 |     INFO |    AutoDockGPU-0 | Parsed isomer 'DGRGLKZMKWPMOH-UHFFFAOYNA-N', score -4.41
2023-09-05 10:47:11,920 |  WARNING |    AutoDockGPU-0 | Docking isomer 'SNHYTHIEXHWXCQ-UHFFFAOYNA-N' failed
2023-09-05 10:47:11,921 |  WARNING |    AutoDockGPU-0 | Docking isomer 'BLUAPHUUFWSBEL-UHFFFAOYNA-N' failed
2023-09-05 10:47:12,962 |     INFO |    dock-parallel | Node 'AutoDockGPU-0' finished (5/9)
2023-09-05 10:47:13,428 |     INFO |    dock-parallel | Node 'logresult' finished (6/9)
2023-09-05 10:47:13,423 |     INFO |        logresult | Received data: '[-3.36 -4.41   nan   nan]'
2023-09-05 10:47:13,784 |     INFO |    dock-parallel | Node 'reap-out' finished (7/9)
2023-09-05 10:47:13,924 |     INFO |    dock-parallel | Node 'reap-out_scores' finished (8/9)
2023-09-05 10:47:14,084 |     INFO |    dock-parallel | Node 'void' finished (9/9)
2023-09-05 10:47:14,587 |     INFO |    dock-parallel | Execution completed :), total runtime: 0:00:05.175187
        2 nodes completed successfully
        7 nodes stopped due to closing ports
        0 nodes failed
        0:00:27.012650 total walltime
        0:00:10.281256 spent waiting for resources or other nodes