Table of Contents
Fetching ...

Circuits, Features, and Heuristics in Molecular Transformers

Kristof Varadi, Mark Marosi, Peter Antal

TL;DR

The paper probes how autoregressive molecular transformers learn to enforce chemical grammar and validity in SMILES generation. It combines mechanistic analyses of attention circuits (ring and branch) with valence-budgeting signals in the residual stream, and uses sparse autoencoders to extract interpretable feature dictionaries tied to functional groups. The results show specialized circuits and disentangled latent detectors that improve downstream property prediction (e.g., MoleculeACE, cADME) and enable controllable generation through latent steering. Overall, the work provides a concrete, testable framework for mechanistic interpretability in molecular language models and highlights opportunities to guide design and optimization with interpretable internal signals.

Abstract

Transformers generate valid and diverse chemical structures, but little is known about the mechanisms that enable these models to capture the rules of molecular representation. We present a mechanistic analysis of autoregressive transformers trained on drug-like small molecules to reveal the computational structure underlying their capabilities across multiple levels of abstraction. We identify computational patterns consistent with low-level syntactic parsing and more abstract chemical validity constraints. Using sparse autoencoders (SAEs), we extract feature dictionaries associated with chemically relevant activation patterns. We validate our findings on downstream tasks and find that mechanistic insights can translate to predictive performance in various practical settings.

Circuits, Features, and Heuristics in Molecular Transformers

TL;DR

The paper probes how autoregressive molecular transformers learn to enforce chemical grammar and validity in SMILES generation. It combines mechanistic analyses of attention circuits (ring and branch) with valence-budgeting signals in the residual stream, and uses sparse autoencoders to extract interpretable feature dictionaries tied to functional groups. The results show specialized circuits and disentangled latent detectors that improve downstream property prediction (e.g., MoleculeACE, cADME) and enable controllable generation through latent steering. Overall, the work provides a concrete, testable framework for mechanistic interpretability in molecular language models and highlights opportunities to guide design and optimization with interpretable internal signals.

Abstract

Transformers generate valid and diverse chemical structures, but little is known about the mechanisms that enable these models to capture the rules of molecular representation. We present a mechanistic analysis of autoregressive transformers trained on drug-like small molecules to reveal the computational structure underlying their capabilities across multiple levels of abstraction. We identify computational patterns consistent with low-level syntactic parsing and more abstract chemical validity constraints. Using sparse autoencoders (SAEs), we extract feature dictionaries associated with chemically relevant activation patterns. We validate our findings on downstream tasks and find that mechanistic insights can translate to predictive performance in various practical settings.

Paper Structure

This paper contains 67 sections, 5 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Syntax Circuits in Molecular Transformers. We find attention heads involved in pairing ring digits and balancing branch parentheses in SMILES. (Left) Heatmaps showing pointer mass (a-b) and the causal impact of ablation ($\Delta$-margin, c-d) for ring and branch syntax across all layers and heads. White boxes highlight the top two heads by pointer mass. (Right) Performance of the top specialized heads for rings (e) and parentheses (f) as a function of increasing syntactic difficulty, showing both pointer mass and $\Delta$-margin. The L2H7 head consistently identifies correct ring openers and activates more intensely at larger distances. L2H3 points at opening parentheses at branch closures.
  • Figure 2: Valence Capacity is Linearly Decodable.(Left) Layer-wise localization. Blue line shows linear probe accuracy for predicting remaining valence. Dashed red line shows the aggregate causal effect of the valence vector on bond logits. Both metrics peak at Layer 3, indicating this layer contains the most information about valence. (Right) Layer 3 Steering. We intervene on the residual stream ($x \leftarrow x + \alpha \hat{w}$) at decision tokens. Increasing steering intensity shifts probability mass from single bonds (-) to higher-order bonds (=, #). Shaded bands indicate 95% bootstrap confidence intervals ($N=1000$).
  • Figure 3: Feature Robustness is Layer-dependent.(Left) Distribution of Jaccard Similarity measuring feature set identity across layers for different SAE dictionary sizes. (Right) Distribution of cosine similarity scores measuring fluctuations in activation magnitude. Early layers (L0--L1) show high syntax invariance, while deeper layers show increased variance consistent with autoregressive path dependency, where activation magnitudes may fluctuate, but a significant share of features remains active. As dictionary size grows, SAEs exhibit increasingly permutation-dependent behavior.
  • Figure 4: Sparse Dictionaries Contain Fragment Features. Plots show kernel density estimates of fragment-based specificity for common substructures in Layer 3 Dense MLP (blue) and sparse SAE representations (red). The distributions highlight two complementary effects of sparsification: (i) a large mass of near-zero specificity values indicating that most SAE features remain silent or weakly responsive outside of their preferred contexts, and (ii) a distinct high-specificity tail showing that a small number of SAE neurons act as strong detectors for chemically meaningful substructures. In contrast, dense residual representations exhibit smoother, more entangled activation patterns with no comparably sharp fragment-aligned features.
  • Figure 5: Representative Chemical Features.(Left) SAE feature with selective activation on the second nitrogen of urea groups (N–C(=O)–N). (Right) SAE feature that activates on fluorine atoms attached to aromatic carbons, with limited response to aliphatic fluorine, consistent with an aryl-halide detector.
  • ...and 6 more figures