Table of Contents
Fetching ...

Unveiling Latent Knowledge in Chemistry Language Models through Sparse Autoencoders

Jaron Cohen, Alexander G. Hasson, Sara Tanovic

TL;DR

This work tackles the interpretability gap in chemistry foundation models by applying sparse autoencoders to the internal representations of a chemistry CLM (SMI-TED). The authors formulate a sparse dictionary learning framework and demonstrate that SAE features disentangle latent chemistry knowledge into substructures, physicochemical descriptors, and functional concepts, with causal steering enabling targeted molecular modifications while preserving model fidelity. They validate the approach through substructure detection, descriptor correlations, and toxicity prediction, highlighting both the interpretability gains and the potential for safer, more controllable chemical AI. The findings suggest that CLMs encode rich, navigable chemical knowledge that can be accessed and manipulated via sparse, interpretable feature dictionaries, informing both foundational understanding and practical acceleration of computational chemistry research.

Abstract

Since the advent of machine learning, interpretability has remained a persistent challenge, becoming increasingly urgent as generative models support high-stakes applications in drug and material discovery. Recent advances in large language model (LLM) architectures have yielded chemistry language models (CLMs) with impressive capabilities in molecular property prediction and molecular generation. However, how these models internally represent chemical knowledge remains poorly understood. In this work, we extend sparse autoencoder techniques to uncover and examine interpretable features within CLMs. Applying our methodology to the Foundation Models for Materials (FM4M) SMI-TED chemistry foundation model, we extract semantically meaningful latent features and analyse their activation patterns across diverse molecular datasets. Our findings reveal that these models encode a rich landscape of chemical concepts. We identify correlations between specific latent features and distinct domains of chemical knowledge, including structural motifs, physicochemical properties, and pharmacological drug classes. Our approach provides a generalisable framework for uncovering latent knowledge in chemistry-focused AI systems. This work has implications for both foundational understanding and practical deployment; with the potential to accelerate computational chemistry research.

Unveiling Latent Knowledge in Chemistry Language Models through Sparse Autoencoders

TL;DR

This work tackles the interpretability gap in chemistry foundation models by applying sparse autoencoders to the internal representations of a chemistry CLM (SMI-TED). The authors formulate a sparse dictionary learning framework and demonstrate that SAE features disentangle latent chemistry knowledge into substructures, physicochemical descriptors, and functional concepts, with causal steering enabling targeted molecular modifications while preserving model fidelity. They validate the approach through substructure detection, descriptor correlations, and toxicity prediction, highlighting both the interpretability gains and the potential for safer, more controllable chemical AI. The findings suggest that CLMs encode rich, navigable chemical knowledge that can be accessed and manipulated via sparse, interpretable feature dictionaries, informing both foundational understanding and practical acceleration of computational chemistry research.

Abstract

Since the advent of machine learning, interpretability has remained a persistent challenge, becoming increasingly urgent as generative models support high-stakes applications in drug and material discovery. Recent advances in large language model (LLM) architectures have yielded chemistry language models (CLMs) with impressive capabilities in molecular property prediction and molecular generation. However, how these models internally represent chemical knowledge remains poorly understood. In this work, we extend sparse autoencoder techniques to uncover and examine interpretable features within CLMs. Applying our methodology to the Foundation Models for Materials (FM4M) SMI-TED chemistry foundation model, we extract semantically meaningful latent features and analyse their activation patterns across diverse molecular datasets. Our findings reveal that these models encode a rich landscape of chemical concepts. We identify correlations between specific latent features and distinct domains of chemical knowledge, including structural motifs, physicochemical properties, and pharmacological drug classes. Our approach provides a generalisable framework for uncovering latent knowledge in chemistry-focused AI systems. This work has implications for both foundational understanding and practical deployment; with the potential to accelerate computational chemistry research.

Paper Structure

This paper contains 42 sections, 7 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Overview of our workflow. Embeddings are extracted from SMI-TED and converted into features via the SAE model. These features are then interpreted to find relationships with structural and physical information.
  • Figure 2: Feature landscape calculated on a 10k subset of the MOSES dataset. Each point represents one SAE feature, mapped according to three metrics: (1) Activation frequency (x-axis) measures how many molecules activate this feature, revealing whether it detects common or rare chemical attributes; (2) Mean normalised activation (y-axis) quantifies the typical strength when the feature activates, indicating its importance when present; (3) Coefficient of variation (colour gradient) represents consistency of activation strength, with darker points showing more consistent behaviour.
  • Figure 3: Examples of molecules with altered substructures highlighted before (green) and after (red) steering. Steering is performed by setting the specified feature activation to zero.
  • Figure S1: Sparsity-Fidelity Trade-off Across SAE Configurations.
  • Figure S2: Feature activation frequency distributions across SAE hyperparameter configurations. Each subplot shows the histogram of activation frequencies for individual SAE features, organised by expansion factor (rows) and sparsity level (columns). Histograms use logarithmic binning and scaling to visualise the characteristic heavy-tailed distribution of feature activations.
  • ...and 7 more figures