Table of Contents
Fetching ...

AtomDisc: An Atom-level Tokenizer that Boosts Molecular LLMs and Reveals Structure--Property Associations

Mingxu Zhang, Dazhong Shen, Ying Sun

TL;DR

AtomDisc introduces an atom-level discrete tokenizer that maps local chemical environments to structure-aware tokens, which are embedded into LLMs to preserve fine-grained topology during language processing. By coupling a GNN-based encoder with vector-quantized tokens aligned to the LLM embedding space, AtomDisc injects a mechanistic chemical bias that improves property prediction and molecular generation tasks, achieving state-of-the-art results on MoleculeNet benchmarks and challenging generative tasks. The approach also yields interpretable token–property relationships, with mix tokens capturing context-dependent environments and enabling structure-centric instruction routing during generation. These findings demonstrate a path toward more capable and interpretable molecular LLMs, capable of mechanistic reasoning in drug design, catalysis, and materials science.

Abstract

Advances in large language models (LLMs) are accelerating discovery in molecular science. However, adapting molecular information to the serialized, token-based processing of LLMs remains a key challenge. Compared to other representations, molecular graphs explicitly encode atomic connectivity and local topological environments, which are key determinants of atomic behavior and molecular properties. Despite recent efforts to tokenize overall molecular topology, there still lacks effective fine-grained tokenization of local atomic environments, which are critical for determining sophisticated chemical properties and reactivity. To address these issues, we introduce AtomDisc, a novel framework that quantizes atom-level local environments into structure-aware tokens embedded directly in LLM's token space. Our experiments show that AtomDisc, in a data-driven way, can distinguish chemically meaningful structural features that reveal structure-property associations. Equipping LLMs with AtomDisc tokens injects an interpretable inductive bias that delivers state-of-the-art performance on property prediction and molecular generation. Our methodology and findings can pave the way for constructing more powerful molecular LLMs aimed at mechanistic insight and complex chemical reasoning.

AtomDisc: An Atom-level Tokenizer that Boosts Molecular LLMs and Reveals Structure--Property Associations

TL;DR

AtomDisc introduces an atom-level discrete tokenizer that maps local chemical environments to structure-aware tokens, which are embedded into LLMs to preserve fine-grained topology during language processing. By coupling a GNN-based encoder with vector-quantized tokens aligned to the LLM embedding space, AtomDisc injects a mechanistic chemical bias that improves property prediction and molecular generation tasks, achieving state-of-the-art results on MoleculeNet benchmarks and challenging generative tasks. The approach also yields interpretable token–property relationships, with mix tokens capturing context-dependent environments and enabling structure-centric instruction routing during generation. These findings demonstrate a path toward more capable and interpretable molecular LLMs, capable of mechanistic reasoning in drug design, catalysis, and materials science.

Abstract

Advances in large language models (LLMs) are accelerating discovery in molecular science. However, adapting molecular information to the serialized, token-based processing of LLMs remains a key challenge. Compared to other representations, molecular graphs explicitly encode atomic connectivity and local topological environments, which are key determinants of atomic behavior and molecular properties. Despite recent efforts to tokenize overall molecular topology, there still lacks effective fine-grained tokenization of local atomic environments, which are critical for determining sophisticated chemical properties and reactivity. To address these issues, we introduce AtomDisc, a novel framework that quantizes atom-level local environments into structure-aware tokens embedded directly in LLM's token space. Our experiments show that AtomDisc, in a data-driven way, can distinguish chemically meaningful structural features that reveal structure-property associations. Equipping LLMs with AtomDisc tokens injects an interpretable inductive bias that delivers state-of-the-art performance on property prediction and molecular generation. Our methodology and findings can pave the way for constructing more powerful molecular LLMs aimed at mechanistic insight and complex chemical reasoning.

Paper Structure

This paper contains 109 sections, 17 equations, 33 figures, 33 tables.

Figures (33)

  • Figure 1: Overview of our approach.(a) A molecule graph is first processed by the AtomDisc Tokenizer, which assigns context-specific codes to atoms, converting each into a token that reflects its unique chemical environment. These structure tokens are concatenated with language tokens and input to the LLM. (b) AtomDisc-LLM workflow: molecular graphs are embedded, discretized into atom-level tokens, and combined with language tokens for unified modeling. (c) Example downstream tasks, including reagent prediction, property prediction, forward reaction prediction, and retrosynthesis.
  • Figure 2: Atom-level tokenization analysis and integration with LLMs.(a) 2D t-SNE visualization of atom-level representations before quantization; points within the same circle form a token cluster. (b) Token entropy distribution: KDE plots of token density (left) and frequency (right). (c) Mixture-token composition: Sankey diagram showing functional group nodes and edges representing shared mixture tokens (edge thickness indicates frequency). (d) Functional-group token allocation: number of unique tokens per group; color encodes Average Token Utilization (ATU, $\text{ATU} = \tfrac{\text{Total Usage}}{\text{Unique Tokens}}$), with darker shades for higher ATU. (e) Pure token ratio for each functional group (fraction of tokens exclusive to that group); color intensity reflects the number of associated mixture tokens.
  • Figure 3: Visualizations for property prediction. (a) Ablation study: comparison of performance with vs. without ST. (b) Attention allocation: attention values toward positive/negative functional groups in models trained with vs. without ST. (c) Representation analysis: for each functional group and its $r{=}2$ neighborhood, motif embeddings (mean-pooled from SMILES or AtomDisc tokens) are projected to $\mathbb{R}^2$ via t-SNE and colored by effect class (red: negative, yellow: positive, blue: neutral). Cluster compactness is quantified by the Davies–Bouldin (DB) Index. The ring plot shows the Gaussian KDE over angles $\theta=\arctan2(y,x)$ after normalized t-SNE coordinates $(x,y)$ to unit-circle; brighter regions denote higher density.
  • Figure 4: Analysis of molecular generation tasks on Mol-Instructions datasets.(a) Quantitative evaluation: benchmark performance with baseline and ablation (with vs. without structural tokens, ST) comparisons; radial plots show multiple metrics, bar charts show dataset-level scores. (b) Case studies: representative prediction examples from models with and without ST, highlighting top-3 attended atoms (red = with ST, blue = without ST). (c) Attention dynamics: task-specific attention concentration and layer-wise activation; metrics include max weight, top-5 cumulative weight, standard deviation, and Gini coefficient. (d–h) Dataset-level statistics: aggregate forward-prediction analysis covering (d) attention entropy distribution, (e) attention concentration,(f) regional attention, (g) cross-region bidirectional attention (color = weight intensity), and (h) layer-activation comparison across task. Mean values and standard-deviation bands are shown.
  • Figure S1: The vast majority of codewords are actively utilized, indicating that the VQ discretizer successfully captures a diverse range of atomic environments. This high code diversity demonstrates that our VQ tokenizer can represent many distinct structural and chemical contexts, rather than collapsing onto a few trivial patterns. Such diverse code usage reflects the model’s capacity to distinguish atoms with different chemical properties, even among atoms of the same element type.
  • ...and 28 more figures