AtomDisc: An Atom-level Tokenizer that Boosts Molecular LLMs and Reveals Structure--Property Associations
Mingxu Zhang, Dazhong Shen, Ying Sun
TL;DR
AtomDisc introduces an atom-level discrete tokenizer that maps local chemical environments to structure-aware tokens, which are embedded into LLMs to preserve fine-grained topology during language processing. By coupling a GNN-based encoder with vector-quantized tokens aligned to the LLM embedding space, AtomDisc injects a mechanistic chemical bias that improves property prediction and molecular generation tasks, achieving state-of-the-art results on MoleculeNet benchmarks and challenging generative tasks. The approach also yields interpretable token–property relationships, with mix tokens capturing context-dependent environments and enabling structure-centric instruction routing during generation. These findings demonstrate a path toward more capable and interpretable molecular LLMs, capable of mechanistic reasoning in drug design, catalysis, and materials science.
Abstract
Advances in large language models (LLMs) are accelerating discovery in molecular science. However, adapting molecular information to the serialized, token-based processing of LLMs remains a key challenge. Compared to other representations, molecular graphs explicitly encode atomic connectivity and local topological environments, which are key determinants of atomic behavior and molecular properties. Despite recent efforts to tokenize overall molecular topology, there still lacks effective fine-grained tokenization of local atomic environments, which are critical for determining sophisticated chemical properties and reactivity. To address these issues, we introduce AtomDisc, a novel framework that quantizes atom-level local environments into structure-aware tokens embedded directly in LLM's token space. Our experiments show that AtomDisc, in a data-driven way, can distinguish chemically meaningful structural features that reveal structure-property associations. Equipping LLMs with AtomDisc tokens injects an interpretable inductive bias that delivers state-of-the-art performance on property prediction and molecular generation. Our methodology and findings can pave the way for constructing more powerful molecular LLMs aimed at mechanistic insight and complex chemical reasoning.
