Model retraining and integration (GPU generally required)

Quite a few modules in ASKCOS come with the code for model retraining under the same repositories. In the past few years, we have gradually converged to using pytorch/torchserve for training and serving ML models. Regardless of the code, where applicable, the training instruction also resides in the repo (typically as the last part of the README) with a somewhat standardized structure. Here we will use the template-relevance model for one-step retrosynthesis as an example. Lots of the content below is copied over from https://gitlab.com/mlpds_mit/askcosv2/retro/template_relevance, and we will walk through the README with more explanation.

Note for MLPDS members: the steps 1)-5) for adding the model archive into the deployment is different for deployment with docker compose or helm. With the images pre-built with data, you will need to customize some file(s) and map them into the containers at runtime. Please refer to askcos2_deploy wiki for these steps. The process for retraining and creating model archive is the same.

Model retraining and benchmarking

# from template-relevance README
Step 1/4: Environment Setup

Follow the instructions in Step 1 in the Serving section to build or pull the GPU docker image. It should have the name `${ASKCOS_REGISTRY}/retro/template_relevance:1.0-gpu`

The first step is generally building a GPU-based Docker image. All subsequent scripts will be run inside the Docker container.

Note: only Docker is fully supported. The support for Singularity is partial, and if you run the Docker image with other engines (e.g., Apptainer, Podman), you would need to craft your own benchmark*.sh script for the pipeline to work. For Docker, it is recommended to add the user to the docker group first, so that only docker run or sh scripts_with_docker_command.sh is needed, rather than sudo docker run or sudo sh scripts_with_docker_command.sh. We have seen a lot of troubles with the scripts if executed with sudo.

# from template-relevance README
Step 2/4: Data Preparation

- Option 1: provide pre-split reaction data
...
- Option 2: provide unsplit reaction data
...

Step 3/4: Path Configuration

- Case 1: if pre-split reaction data is provided

Configure the environment variables in `./scripts/benchmark_in_docker_presplit.sh`, especially the paths, to point to the *absolute* paths of raw files and desired output paths.

# benchmark_in_docker_presplit.sh
...
export DATA_NAME="my_new_reactions"
export TRAIN_FILE=$PWD/new_data/raw_train.csv
...

- Case 2: if unsplit reaction data is provided

Configure the environment variables in `./scripts/benchmark_in_docker_unsplit.sh`, especially the paths, to point to the *absolute* paths of the raw file and desired output paths.

# benchmark_in_docker_unsplit.sh
...
export DATA_NAME="my_new_reactions"
export ALL_REACTION_FILE=$PWD/new_data/all_reactions.csv
...

The second and third steps are generally raw data preparation and changing the path(s) in the training script to point to the input file(s). For the retro and forward models we provide with ASKCOSv2, only a .csv file with a list of reactions SMILES is needed. We leave any filtering, cleaning or pre-splitting of the reaction data to the user. Depending on the exact model, there may be minimal post-processing on the raw data, other than dropping the entries that cannot be handled.

# from template-relevance README
Step 4/4: Training and Benchmarking

Run benchmarking on a machine with GPU using

    sh scripts/benchmark_in_docker_presplit.sh

for pre-split data, or

    sh scripts/benchmark_in_docker_unsplit.sh

for unsplit data. ...

Next, run the training/benchmarking script and do something else while waiting for the training, which may take hours to days depending on the model used and the data size. The training parameters typically do not significantly affect the performance and therefore do not have to be adjusted, especially on larger datasets. We leave it up to the user for hyperparameter tuning (e.g., by changing them in the training script), if you know what you are doing.

Converting trained model into a servable archive

In a general sense, torchserve serves any model from its model archive, which is a .mar file (similar to .zip) containing all the needed scripts/checkpoints. To generate this model archive from trained model, simply change the paths in the respective archiving script in the repo, and then execute the script.

# from template-relevance README
Change the arguments accordingly in the script before running. It's mostly bookkeeping by replacing the data name and/or checkpoint paths; the script should be self-explanatory. Then execute the scripts with

sh scripts/archive_in_docker.sh

The servable model archive (.mar) will be generated under ./mars.

Add the model archive into the deployment

As the last step of model integration, adding the model archive into the deployment requires three parts for backend integration, and additional two parts for frontend integration:

Backend

copying the archive over to the deployment machine
adding the model name into the central module config
adding the model reference into the serving script/command

Frontend

copying the zipped reaction, historian, and template files over to the deployment machine
seeding the database with these files

Let say now you have the .mar file ready on your machine for training

$MY_TRAIN_PATH/retro/template_relevance/mars/my_model.mar

and your deployment on another machine from

$MY_DEPLOY_PATH/ASKCOSv2/askcos2_core

1) Copying the archive over to the deployment machine

shell

scp $MY_TRAIN_MACHINE:$MY_TRAIN_PATH/retro/template_relevance/mars/my_model.mar \
  $MY_DEPLOY_PATH/ASKCOSv2/retro/template_relevance/mars/

2) Adding the model name into the central module config

In $MY_DEPLOY_MACHINE:$MY_DEPLOY_PATH/ASKCOSv2/askcos2_core/configs/your_module_config.py (module_config_full.py by default), add your model in the "available_model_names" of "retro_template_relevance"

# in your_module_config.py
...
"retro_template_relevance": {
  ...
  "deployment": {
    ...
    "available_model_names": [
      "bkms_metabolic",
      "my_model",                          <================== this is your new model name
      "pistachio",
      "pistachio_ringbreaker",
      "reaxys",
      "reaxys_biocatalysis"
    ]
  }
},

3) Adding the model reference into the serving script/command

In $MY_DEPLOY_MACHINE:$MY_DEPLOY_PATH/ASKCOSv2/retro/template_relevance/scripts/serve_{cpu,gpu}_in_docker.sh, add your model under --models

# in serve_cpu_in_docker.sh and serve_gpu_in_docker.sh
...
docker run -d --rm \
  ...
  torchserve \
  ...
  --models \
  bkms_metabolic=bkms_metabolic.mar \
  my_model=my_model.mar \                 <================== this is your new model archive
  pistachio=pistachio.mar \
  pistachio_ringbreaker=pistachio_ringbreaker.mar \
  reaxys=reaxys.mar \
  reaxys_biocatalysis=reaxys_biocatalysis.mar \
  ...

1)-3) would suffice if only backend integration is needed. Simply update the deployment from askcos2_core

shell

~/ASKCOSv2/askcos2_core$ make stop
~/ASKCOSv2/askcos2_core$ make update

And now your new model can be queried via APIs. For frontend integration, two extra steps below are needed before updating the deployment.

4) Copying the zipped reaction, historian, and template files over to the deployment machine

shell

mkdir -p $MY_DEPLOY_PATH/ASKCOSv2/tmp
scp $MY_TRAIN_MACHINE:$MY_TRAIN_PATH/retro/template_relevance/data/$DATA_NAME/processed/reactions.$DATA_NAME.json.gz \
  $MY_DEPLOY_PATH/ASKCOSv2/tmp/
scp $MY_TRAIN_MACHINE:$MY_TRAIN_PATH/retro/template_relevance/data/$DATA_NAME/processed/historian.$DATA_NAME.json.gz \
  $MY_DEPLOY_PATH/ASKCOSv2/tmp/
scp $MY_TRAIN_MACHINE:$MY_TRAIN_PATH/retro/template_relevance/data/$DATA_NAME/processed/retro.templates.$DATA_NAME.json.gz \
  $MY_DEPLOY_PATH/ASKCOSv2/tmp/

These three zipped files should have been generated as part of the preprocessing pipeline. The historian file contains the history for all chemicals, with its occurrences as reactants and products respectively in the training reactions. WARNING: the reaction file contains all information from training reactions. Once seeded into the database, the information will be available for querying via the API and the frontend. Please ensure that you are only exposing your deployment to people who are supposed to have access to these reaction data. The ASKCOS team takes no responsibility for any potential data leak.

5) Seeding the database with these files

shell

$ cd askcos2_core
$ bash deploy.sh seed-db -x $MY_DEPLOY_PATH/ASKCOSv2/tmp/reactions.$DATA_NAME.json.gz
$ bash deploy.sh seed-db -c $MY_DEPLOY_PATH/ASKCOSv2/tmp/historian.$DATA_NAME.json.gz
$ bash deploy.sh seed-db -r $MY_DEPLOY_PATH/ASKCOSv2/tmp/retro.templates.$DATA_NAME.json.gz

Note that seeding the reactions will force fingerprint pre-computation for all reactions which might take a while.

After 1)-5), the update command can now be run

shell

~/ASKCOSv2/askcos2_core$ make stop
~/ASKCOSv2/askcos2_core$ make update

And now your new model is ready for use! That is, you should be able to select it from the setting menu. Frontend features which rely on reactions (e.g., linking to reference_url) and historian (e.g., displaying precedent occurrences of chemical nodes) should also be able to reflect the addition of new data in the database.

Model retraining and integration (GPU generally required) ​

Model retraining and benchmarking ​

Converting trained model into a servable archive ​

Add the model archive into the deployment ​

1) Copying the archive over to the deployment machine ​

2) Adding the model name into the central module config ​

3) Adding the model reference into the serving script/command ​

4) Copying the zipped reaction, historian, and template files over to the deployment machine ​

5) Seeding the database with these files ​

Model retraining and integration (GPU generally required)

Model retraining and benchmarking

Converting trained model into a servable archive

Add the model archive into the deployment

1) Copying the archive over to the deployment machine

2) Adding the model name into the central module config

3) Adding the model reference into the serving script/command

4) Copying the zipped reaction, historian, and template files over to the deployment machine

5) Seeding the database with these files