Skip to content

Universal Data Structure (UDS)

Universal Data Structure (UDS) aims to provide a unifying framework on how to represent tree/graph network data displayed & modified in the frontend and backend. It aims to unify the data representation used to communicate between both ends, providing seamless transition between services.

Data Structure Trinity

There are 3 major data structures which collectively represent the metadata & connectivity of synthesis pathways in ASKCOS. They are known as Graph, Unified Tree & Pathways. Typically, a Tree Builder (TB) job would be working either on a Graph (MCTS), or Unified Tree (Retro*) data structure when searching for synthesis pathways.

Further post-processing on the Tree Builder job yields a complete acyclic graph which captures the entirety of the searched space, and enumeration of the search graph produces individual pathways. The Tree Builder job can be viewed in 2 ways, one through Pathways data structure in Tree Explorer (TE), another through Unified Tree data structure in IPP. On the other hand, an expand one call from the frontend IPP page interacts with the backend, modifying the IPP on the fly.

It is noteworthy to point out, modification done on the canvas of IPP and Tree Explorer inherently modifies the frontend dispGraph and dataGraph.

Graph

  • Directed Acyclic Graph
  • Node ID: SMILES
  • Used in TB graph output, IPP/TE dataGraph

Unified Tree

  • Directed Rooted Trees
  • Node ID: UUID
  • Used in IPP/TE dispGraph

Pathways

  • Multiple Directed Rooted Trees
  • Node ID: UUID
  • Used in TB path output

image.png

UDS representation format

To unify the 3 different data structure, UDS is designed to capture all nuances of this representation to eliminate redundancy in representation and facilitate easier editing through a more flattened representation.

{
    "uds": {
        "node_dict": {
            "CHEM_SMILES_1": {
                CHEM_SMILES_1_METADATA
            },
            "CHEM_SMILES_2": {
                CHEM_SMILES_2_METADATA
            },
            "RXN_SMILES_1": {
                RXN_SMILES_1_METADATA
            },
            "RXN_SMILES_2": {
                RXN_SMILES_2_METADATA
            },
            ...
        },
        "graph": [
            {
                "source": CHEM_SMILES_1,
                "target": RXN_SMILES_1,
            },
            {
                "source": CHEM_SMILES_2,
                "target": RXN_SMILES_2,
            },
            ...
        ],
        "uuid2smiles": {
            UUID_1: CHEM_SMILES_1,
            UUID_2: CHEM_SMILES_2,
            UUID_3: RXN_SMILES_1,
            UUID_4: RXN_SMILES_2,
            ...
        },
        "pathways": [
            [
                {
                    "source": UUID_1,
                    "target": UUID_2,
                },
                {
                    "source": UUID_3,
                    "target": UUID_4,
                },
                ...
            ],
            [
                {
                    "source": UUID_5,
                    "target": UUID_6,
                },
                {
                    "source": UUID_7,
                    "target": UUID_8,
                },
                ... 
            ],
            ...
        ],
        "pathways_properties": [
            {
                PATH_PROP_1
            },
            {
                PATH_PROP_2
            },
            ...
        ]
    }
}

The representation format above consists of node_dict , graph , uuid2smiles , pathways ,pathways_properties section. It separates the node metadata section and the connectivity to remove redundancy in representation, allowing easier modification of information. Connectivity of graph/pathways is stored in a nodelink format.

node_dict - Stores all of the metadata of chemical node or reaction node returned in the expand-one call. Dict of Dict

graph - Stores the search graph connectivity in nodelink format. List of Dict

uuid2smiles - Mapping of UUID to SMILES, facilitating reconstruction of UDS back to Graph Object. Dict

pathways - Stores the pathways connectivity in nodelink format. List of List of Dict

pathways_properties - Store the pathways properties from tree analysis jobs. List of Dict

Chemical Node Dictionary

json
{
    "smiles": "CN(C)CCOC(c1ccccc1)c1ccccc1",
    "as_reactant": 59, 
    "as_product": 44,
    "properties": [
        properties_list
    ],
    "purchase_price": 4.13,
    "terminal": false,
    "type": "chemical",
    "id": "CN(C)CCOC(c1ccccc1)c1ccccc1"
}

Reaction Node Dictionary

json
{
    "smiles": "CN(C)CCCl.OC(c1ccccc1)c1ccccc1>>CN(C)CCOC(c1ccccc1)c1ccccc1",
    "precursor_rank": 1, // rank from reranker
    "precursor_score": -0.003586091330295326, // score from reranker
    "plausibility": 0.9981883764266968, 
    "rxn_score_from_model": 0.334626168012619, // # average of normalized_model_score
    "model_metadata": [
        {
            "direction": "retro",
            "backend": "template_relevance",
            "model_name": "reaxys",
            "attributes": {
                "max_num_templates": 1000,
                "max_cum_prob": 0.995,
                "attribute_filter": []
            },
            "model_score": 0.334626168012619,
            "normalized_model_score": 0.334626168012619,
            "rank": 1,
            "reaction_id": null,
            "reaction_set": null,
            "source": {
                "template": { ... }, // template relevance
                "reaction_data": {} // retrosim
            }
        },
        {
            "direction": "retro",
            "backend": "augmented_transformer",
            "model_name": "USPTO_FULL",
            "attributes": {},
            "model_score": 0.13734960132034285,
            "normalized_model_score": 0.17782974596553017,
            "rank": 2,
            "reaction_id": null,
            "reaction_set": null,
            "source": {
                "template": null,
                "reaction_data": null
            }
        }
    ],
    "precursor_properties": {
      "rms_molwt": 150.57960055284425,
      "num_rings": 2,
      "scscore": 1.51275690065044
    },
    "reaction_properties": {
      "canonical_reaction_smiles": "CN(C)CCCl.OC(c1ccccc1)c1ccccc1>>CN(C)CCOC(c1ccccc1)c1ccccc1",
      "mapped_smiles": "Cl[CH2:5][CH2:4][N:2]([CH3:1])[CH3:3].[OH:6][CH:7]([c:8]1[cH:9][cH:10][cH:11][cH:12][cH:13]1)[c:14]1[cH:15][cH:16][cH:17][cH:18][cH:19]1>>[CH3:1][N:2]([CH3:3])[CH2:4][CH2:5][O:6][CH:7]([c:8]1[cH:9][cH:10][cH:11][cH:12][cH:13]1)[c:14]1[cH:15][cH:16][cH:17][cH:18][cH:19]1",
      "plausibility": 0.9981883764266968,
      "reacting_atoms": [
        5,
        6
      ],
      "selec_error": null,
      "cluster_id": null,
      "cluster_name": null
    },
    "type": "reaction",
    "id": "CN(C)CCCl.OC(c1ccccc1)c1ccccc1>>CN(C)CCOC(c1ccccc1)c1ccccc1"
}

Released under the MIT License.