Table of Contents
Fetching ...

Representing Molecules as Random Walks Over Interpretable Grammars

Michael Sun, Minghao Guo, Weize Yuan, Veronika Thost, Crystal Elaine Owens, Aristotle Franklin Grosz, Sharvaa Selvan, Katelyn Zhou, Hassan Mohiuddin, Benjamin J Pedretti, Zachary P Smith, Jie Chen, Wojciech Matusik

TL;DR

This work addresses the challenge of discovering complex, modular molecules in data-scarce material design settings by introducing a motif-based graph grammar over a motif graph G and representing molecules as random walks on a derivation H_M of a context-sensitive grammar. The method integrates motif-based fragmentation, learnable grammar parameters via a graph-diffusion process, and end-to-end downstream prediction using a GNN on hat{H}_M, enabling both accurate property prediction and diverse, synthesizable molecule generation. Key contributions include an explicit, interpretable grammar over motifs, efficient learning with a set-based memory, and empirical gains on GC, HOPV, and PTC across prediction and generation, along with rule extraction and visualization analyses that illuminate chemical design principles revealed by the model. The approach promises practical impact by enabling data-efficient discovery workflows with domain-expert collaboration, as well as interpretability that supports hypothesis generation and experimental validation, all while maintaining high synthesis feasibility of generated designs. In particular, the framework leverages graph diffusion over a learnable motif graph to capture the design space with $\frac{dx_t}{dt} = L(\Phi, t) x_t$, where $x_t \in \mathbb{R}^{|V|}$ and $L(\Phi,t) = D - \hat{W}(t)$ and $\hat{W}(t) = W + h(c_t; \phi)$, with a set-based memory $c^{(t+1)} = \frac{t}{t+1} c^{(t)} + \frac{1}{t+1} p^{(t)}$, enabling data-efficient learning and interpretable grammar-driven generation.

Abstract

Recent research in molecular discovery has primarily been devoted to small, drug-like molecules, leaving many similarly important applications in material design without adequate technology. These applications often rely on more complex molecular structures with fewer examples that are carefully designed using known substructures. We propose a data-efficient and interpretable model for representing and reasoning over such molecules in terms of graph grammars that explicitly describe the hierarchical design space featuring motifs to be the design basis. We present a novel representation in the form of random walks over the design space, which facilitates both molecule generation and property prediction. We demonstrate clear advantages over existing methods in terms of performance, efficiency, and synthesizability of predicted molecules, and we provide detailed insights into the method's chemical interpretability.

Representing Molecules as Random Walks Over Interpretable Grammars

TL;DR

This work addresses the challenge of discovering complex, modular molecules in data-scarce material design settings by introducing a motif-based graph grammar over a motif graph G and representing molecules as random walks on a derivation H_M of a context-sensitive grammar. The method integrates motif-based fragmentation, learnable grammar parameters via a graph-diffusion process, and end-to-end downstream prediction using a GNN on hat{H}_M, enabling both accurate property prediction and diverse, synthesizable molecule generation. Key contributions include an explicit, interpretable grammar over motifs, efficient learning with a set-based memory, and empirical gains on GC, HOPV, and PTC across prediction and generation, along with rule extraction and visualization analyses that illuminate chemical design principles revealed by the model. The approach promises practical impact by enabling data-efficient discovery workflows with domain-expert collaboration, as well as interpretability that supports hypothesis generation and experimental validation, all while maintaining high synthesis feasibility of generated designs. In particular, the framework leverages graph diffusion over a learnable motif graph to capture the design space with , where and and , with a set-based memory , enabling data-efficient learning and interpretable grammar-driven generation.

Abstract

Recent research in molecular discovery has primarily been devoted to small, drug-like molecules, leaving many similarly important applications in material design without adequate technology. These applications often rely on more complex molecular structures with fewer examples that are carefully designed using known substructures. We propose a data-efficient and interpretable model for representing and reasoning over such molecules in terms of graph grammars that explicitly describe the hierarchical design space featuring motifs to be the design basis. We present a novel representation in the form of random walks over the design space, which facilitates both molecule generation and property prediction. We demonstrate clear advantages over existing methods in terms of performance, efficiency, and synthesizability of predicted molecules, and we provide detailed insights into the method's chemical interpretability.
Paper Structure (51 sections, 5 equations, 23 figures, 9 tables)

This paper contains 51 sections, 5 equations, 23 figures, 9 tables.

Figures (23)

  • Figure 1: Illustration of our random walk representation: (a) (top) molecule $M$, number 33 (middle) $H_M$ as a connected subgraph of $G$ (bottom) $\hat{H}_M$ as a random walk over $H_M$; (b) the motif graph $G$, each node is a motif $v$ that contains both the molecular fragment $v_B$ (black molecule sections) and the contexts for attachment ($v_R$, red molecule sections), each gray line indicates a possible attachment between nodes. Directed edges of $\hat{H}_M$ use the same color as the dashed border of the corresponding figure of $M$; (c) (top) demonstration of motif matching criteria eq \ref{['eq:criteria-1']}-\ref{['eq:criteria-4']} ($183\leftrightarrow 5$), another example is in Fig. \ref{['match-example']} (bottom) two more examples of $H_M$.
  • Figure 2: Illustration of our generation procedure: (t=1) our learnable grammar parameterized by $\Phi$ samples a state transition $56\rightarrow 9$; (t=2) with the memory of having visited $\{56\}$, our grammar samples a state transition $\rightarrow 71$; (t=10) (bottom) our grammar samples a final transition $5$, which determines the molecular structure (top); our program's notation is $56\rightarrow 9\rightarrow 71 [\rightarrow 70\rightarrow 5]\rightarrow 70:1\rightarrow 5:1$
  • Figure 3: Example molecules from GC, HOPV, and PTC. These datasets are characterized by modular substructures that correspond to meaningful chemical functional groups.
  • Figure 4: Visualization of our motif graph $G$; black edges indicate matched motif pairs, thickness of red edges correspond to the numbers of $H_M$ that traverse it.
  • Figure 5: Varying the training dataset size from 10-70%.
  • ...and 18 more figures