Unveiling Latent Knowledge in Chemistry Language Models through Sparse Autoencoders
Jaron Cohen, Alexander G. Hasson, Sara Tanovic
TL;DR
This work tackles the interpretability gap in chemistry foundation models by applying sparse autoencoders to the internal representations of a chemistry CLM (SMI-TED). The authors formulate a sparse dictionary learning framework and demonstrate that SAE features disentangle latent chemistry knowledge into substructures, physicochemical descriptors, and functional concepts, with causal steering enabling targeted molecular modifications while preserving model fidelity. They validate the approach through substructure detection, descriptor correlations, and toxicity prediction, highlighting both the interpretability gains and the potential for safer, more controllable chemical AI. The findings suggest that CLMs encode rich, navigable chemical knowledge that can be accessed and manipulated via sparse, interpretable feature dictionaries, informing both foundational understanding and practical acceleration of computational chemistry research.
Abstract
Since the advent of machine learning, interpretability has remained a persistent challenge, becoming increasingly urgent as generative models support high-stakes applications in drug and material discovery. Recent advances in large language model (LLM) architectures have yielded chemistry language models (CLMs) with impressive capabilities in molecular property prediction and molecular generation. However, how these models internally represent chemical knowledge remains poorly understood. In this work, we extend sparse autoencoder techniques to uncover and examine interpretable features within CLMs. Applying our methodology to the Foundation Models for Materials (FM4M) SMI-TED chemistry foundation model, we extract semantically meaningful latent features and analyse their activation patterns across diverse molecular datasets. Our findings reveal that these models encode a rich landscape of chemical concepts. We identify correlations between specific latent features and distinct domains of chemical knowledge, including structural motifs, physicochemical properties, and pharmacological drug classes. Our approach provides a generalisable framework for uncovering latent knowledge in chemistry-focused AI systems. This work has implications for both foundational understanding and practical deployment; with the potential to accelerate computational chemistry research.
