Representing Molecules as Random Walks Over Interpretable Grammars
Michael Sun, Minghao Guo, Weize Yuan, Veronika Thost, Crystal Elaine Owens, Aristotle Franklin Grosz, Sharvaa Selvan, Katelyn Zhou, Hassan Mohiuddin, Benjamin J Pedretti, Zachary P Smith, Jie Chen, Wojciech Matusik
TL;DR
This work addresses the challenge of discovering complex, modular molecules in data-scarce material design settings by introducing a motif-based graph grammar over a motif graph G and representing molecules as random walks on a derivation H_M of a context-sensitive grammar. The method integrates motif-based fragmentation, learnable grammar parameters via a graph-diffusion process, and end-to-end downstream prediction using a GNN on hat{H}_M, enabling both accurate property prediction and diverse, synthesizable molecule generation. Key contributions include an explicit, interpretable grammar over motifs, efficient learning with a set-based memory, and empirical gains on GC, HOPV, and PTC across prediction and generation, along with rule extraction and visualization analyses that illuminate chemical design principles revealed by the model. The approach promises practical impact by enabling data-efficient discovery workflows with domain-expert collaboration, as well as interpretability that supports hypothesis generation and experimental validation, all while maintaining high synthesis feasibility of generated designs. In particular, the framework leverages graph diffusion over a learnable motif graph to capture the design space with $\frac{dx_t}{dt} = L(\Phi, t) x_t$, where $x_t \in \mathbb{R}^{|V|}$ and $L(\Phi,t) = D - \hat{W}(t)$ and $\hat{W}(t) = W + h(c_t; \phi)$, with a set-based memory $c^{(t+1)} = \frac{t}{t+1} c^{(t)} + \frac{1}{t+1} p^{(t)}$, enabling data-efficient learning and interpretable grammar-driven generation.
Abstract
Recent research in molecular discovery has primarily been devoted to small, drug-like molecules, leaving many similarly important applications in material design without adequate technology. These applications often rely on more complex molecular structures with fewer examples that are carefully designed using known substructures. We propose a data-efficient and interpretable model for representing and reasoning over such molecules in terms of graph grammars that explicitly describe the hierarchical design space featuring motifs to be the design basis. We present a novel representation in the form of random walks over the design space, which facilitates both molecule generation and property prediction. We demonstrate clear advantages over existing methods in terms of performance, efficiency, and synthesizability of predicted molecules, and we provide detailed insights into the method's chemical interpretability.
