Frequently Asked Questions
Is there a way to search Reaxys and Pistachio at the same time rather than individually, and return the best results from each?
Not at the moment. The code was designed for a single model, so it was more straightforward to add a second model as an independent option. There is definitely value in being able to combine predictions from multiple template sets, and there is some ongoing work to make that possible in the tree builder. However, there are some complexities with merging the results from multiple template prioritizers, primarily around the issue of ranking the results relative to each other, because the absolute values of the template scores are not guaranteed to be comparable across models. For instance, the pistachio model typically returns template scores which are orders of magnitude smaller than the reaxys model.
Are there any recommended settings when using the tree builder v2?
Those are all optional termination settings which are more deterministic alternatives to expansion time (since they don't depend on server load/speed). Unfortunately, it's difficult to provide general guidelines since the expansion behavior can differ significantly depending on the target molecule and other settings like max num. templates and max cum. probability. However, the following context may help:
- An iteration effectively corresponds to one attempted expansion, regardless of whether it successfully produced a precursor.
- The number of chemicals or reactions comes from successful expansions. There will typically be more reactions than chemicals since different reactions may lead to the same precursors.
- A (very) rough guideline is that 3-4 chemicals can be explored per second (would vary depending server specs).
- The chemicals/iterations ratio correlates with the fraction of suggested templates that are applicable to the target, and can typically range from 3-30%.
Why can’t I find any pathways to my target?
Retrosynthetic suggestions are proposed, fundamentally, by templates we have generalized from reaction precedents. For some products, e.g., natural products, there may be one very specific reaction step that is highly substrate- or scaffold-specific. Such a reaction template would have very few (if any) similar precedents, and so the program was not able to successfully generalize to this target. When using the relevance template prioritizer, it’s important to note that we do not apply all of the templates in our library (to save on computational time). That means that even if an obscure template would perform well, we may not apply it to the target because it does not appear relevant. You can return to the one-step retrosynthesis module and instead chose to apply all templates to see if this is the case.
Why doesn’t the program find the known literature route to my target? I’ve seen it published before!
Our primary data source is currently Reaxys, so we are somewhat at the mercy of the quality/inclusivity of their database. There are many known reactions that are not in their database, despite their best efforts to maximize its coverage. Moreover, when you search for a pathway, we do not try to find exact literature precedents, even if there is a record for that exact target compound and a full pathway is known. That may sound counterintuitive, but what we are trying to do is generalize known chemistry to new substrates (and eventually, to new chemistry). Reaxys provides the standard search/lookup tool for known targets, so we are purposefully avoiding replicating their functionality. We do report the number of times a chemical appears as a reactant or product in the database, which can serve as an indication that it is worth searching Reaxys or SciFinder for an exact match.
Why do I get an error message when trying to run an expansion?
There are a limited number of workers available to perform computationally-intensive tasks. There is a chance that other users were running expansions at the same time, so no available pool of workers was found. Often, restarting the page and submitting the request again is sufficient to correct the problem. In rare cases, these workers may run into an error state and refuse to accept incoming requests, at which time they will be restarted. If you keep running into issues, please send us a note.
Why are the routes I’m seeing all very similar?
We are thinking about ways to quantify diversity so we can show a broader range of suggestions. In the meantime, there is still the ability to hide all chemicals or reactions in the tree builder tool. When looking at a result, holding the mouse over a chemical or reaction node will raise a popup box that lets you
- blacklist that reaction or chemical, in which case it will not be used in future pathways
- hide all occurrences of that chemical or reaction, which minimizes all pathways it appears in.
Why are different results obtained if the same compound is searched multiple times?
There is some stochasticity in the search process resulting from parallelization. Since workers may not return results in exactly the same time, the coordinator’s perception of “most promising” may change based on which parts of the expansion have been explored so far. Increasing the expansion time may reduce the variation in the results as the maximum expansion time may be close to the time that is required to find the first pathway. If the number of chemicals and reactions explored by the Tree Builder vary dramatically run-to-run, this could be due to other users making simultaneous requests.
How does changing the chemical logic impact a tree builder prediction?
Changing the chemical logic alters what the search perceives as an acceptable stopping criterion. Without using the chemical logic options, the search will only resolve pathways that terminate in chemicals that appear in our buyables database. Altering the maximum allowed number of atoms of each element, or modifying the number of times the chemical has been seen in or training data, will allow the algorithm to resolve pathways that terminate in molecules that satisfy these criteria.
What do the different colours signify in the output of the tree planner or the Interactive Path Planner?
Chemicals that appear in our buyables database are highlighted with a green border. Chemicals do not appear in our buyables database but do appear in our training set of molecules are highlighted with an orange border. Chemicals that do not show up in any of our databases show up with a red border.
Can the API handle more than one SMILES string at a time?
Currently only one SMILES string can be processed per API request. However, repeated requests could be made programmatically by the user.
Will batch submission via the API slow down a users web experience?
Each API request requires the same amount of processing as a request via the ASKCOS web interface. There is no QoS enabled to give priority to web vs API requests, all are queued in the same manner. Asynchronous task queuing has been enabled for the tree builder, this allows a user to submit a request and walk away. When the request has been completed, the results will be saved in that users account. This allows the user to submit a request, change some parameters, submit another job etc. The user is not waiting for the request’s result to proceed and as such their web experience will not be impacted during a batch submission. If the user is waiting for the result, their web experience may be a little slower than normal, depending on the number of requests in the batch job.
Can the Interactive Reaction Planner’s central graphical results screen/panel be made larger?
Starting with v0.3.1, the panel with the scheme on the left side of the Interactive Path Planner should grow/shrink dynamically to fill the available space in your browser window.
How many active pathways are available during a tree builder query?
The default number of active pathways during MCTS expansion is set to 8 for tree builder v1. Tree builder v2 does not explore pathways in parallel, but is better optimized and performs comparably to v1.
How are the 8 paths chosen from the templates?
The coordinator will choose the 8 most promising partial pathways to expand in parallel and, as results are returned by the worker processes, will dispatch another request based on the new definition of “most promising”. The quantification of that is largely based on the probability assigned by the template relevance network but is also affected by what the coordinator has already explored. The balance between these two represents a direct tradeoff between exploitation and exploration.
Is an expansion time of 120-180 seconds appropriate?
That expansion time should be sufficient. To check, it is recommended that you run the target with a large expansion time and the “return as soon as pathway found” box checked. That will indicate whether it finds the pathway quickly (when it does succeed) or if it usually takes the full time.
Why does ASKCOS Forward Predictor only show one product and not other byproducts of the reaction?
The ASKCOS system looks at the major product, that is the largest species that has undergone a change in connectivity from the reactants. Any byproducts (salts, leaving groups, etc.) are deemed unimportant and are filtered out. Impurities can be predicted using the Impurity predictor from within the Forward predictor site.
How does one correctly build a singularity image for the template relevance container? When built and run, a "rdkit is not available" error appears.
Singularity is a bit different than docker. Singularity seems to inherit its environment variables from the user that is running the container so you might have to set the path and pythonpath. It should already be set in the docker container but if for some reason you are overwriting them then by using your host envs then it might not work. Check that the PYTHONPATH and LD_LIBRARY_PATH variables are set as suggested in lines 16 & 17 in https://gitlab.com/mlpds_mit/ASKCOS/template-relevance/-/blob/dev/Dockerfile.gpu
Why have the memory requirements of release 2021.01 significantly increased? Are there any plans on how these will evolve with new releases?
The memory requirement is definitely something that we're watching closely. We have made a significant effort to optimize the memory use and reduce it as much as possible, which was largely successful and reached a minimum around the 2020.07 release. The recent increases in memory are due to the addition of many new ML models in 2020.10 and 2021.01 and is difficult to avoid. For example, the new fingerprint-based condition recommender is comprised of 4 models totaling over 9 GB. We did consider loading the model on demand, but that would not affect the peak memory use and would only introduce a delay when making the first request. There are probably still minor optimizations to be done, but they will be small in comparison to the size of these models.
Why am I getting TLS handshake errors when trying to download release 2021.01?
Some members were having issues pulling the docker images from GitLab. This may be a country related issue, if so please reach out to the MIT team for assistance. Please remember that Release 2021.01 is a lot larger than previous releases and will need more time to download.
Why am I getting timeout errors from Docker Compose?
If you are experiencing timeout issues, please try increasing the COMPOSE_HTTP_TIMEOUT from the default of 60 to 120 or higher either on the command line or in the .env file.
After cloning Release 2021.01 and running the deploy script, the following error appears: ‘ImportError: cannot import name UnrewinddableBodyError’. Why is this?
This may indicate some version incompatibilities. Please confirm that you are using the latest version of docker-compose and a compatible version of Python (Docker Compose 1.28 requires Python 3.6 or later).
What are Deploy Tokens and how do I get them?
Deploy tokens allow read only access (pull) access to the git and container registries for MLPDS projects. These are useful when deploying applications. The tokens area vailable on the MLPDS Members resource area. Alternatively you can reach out to the MIT team on Microsoft Teams, or via email.
Why are there two sets of tokens?
There are two sets of deployment tokens used, these are:
- Group level tokens used to pull the ASKCOS docker images (non-developer install)
- Project level tokens used to pull the askcos-data repository (required for a developer install). This project level token was generated to address a bug in GitLab where cloning of a repository with Group-level deploy tokens will fail on LFS objects. This does NOT occur when using a project-level Deploy Token (Bug 235398). This second project-level token is only needed if the askcos-data repository needs to be cloned, it will not work on any other repository.
How do I get access to development code to test specific, newly added functionality?
Development code is available on our GitLab instance. You can follow merge requests to see which features have been added, or which are being worked on. Branches that begin with ‘release-’ (i.e. release-0.4.0), and branches that are pre-release candidates and have new features already incorporated that we plan to include in that release. These should be minimally tested, but you may still encounter bugs on release- branches. You can also test new functionality on the demo server (askcos-demo.mit.edu) without having to download the source code. When a new feature is merged, it is automatically pushed to the demo server and may be available to test many weeks before the feature is available in the released code.
Can ASKCOS be scaled up to allow more concurrent users?
The way to scale to more users would be to increase the size of the worker pool. This is done by increasing the values of
- tb_c_workers(default is 12)
- tb_coordinator_mtcs(default is 2)
Both can be found in the deploy file. Do not increase the nproc value, doing so will increase the number of workers that each task tries to use, which may end up making things slower overall.
If two users are making concurrent requests, are the workers distributed evenly between the two users?
If two users make a request at the same time, they will approximately share the computation from the 10 workers. It won’t be an exact split into two. The exact behavior will be that 16 (8×2) requests will be queued at all times, and the 10 workers will perform those expansions in a first-in-first-out manner.
Is it possible to limit the number of users submitting jobs?
Currently, that is not possible.
How can I customize our local ASKCOS deployment to help it stand out from the public version?
Currently you can add a subheader next to ASKCOS in the top left, by changing the ORGANIZATION environment variable in the ‘customization’ file within the deploy folder. For the change to take effect, you will need to restart the app container. For Docker Compose deployment, this can be done using docker-compose up -d --force-recreate app.