Table of Contents
Fetching ...

Atlas-Alignment: Making Interpretability Transferable Across Language Models

Bruno Puri, Jim Berend, Sebastian Lapuschkin, Wojciech Samek

TL;DR

Atlas-Alignment proposes a scalable, cross-model interpretability framework that transfers the semantics of a human-labeled Concept Atlas to unknown subject models by aligning latent spaces with lightweight transformations. The core idea hinges on the Linear and Platonic Representation Hypotheses, enabling semantic querying, retrieval, and steering without training model-specific probes or labeled concept data. Empirical results show robust semantic translation, accurate feature retrieval, and controllable generation across multiple Llama models, with Orthogonal Procrustes consistently outperforming other alignment methods. This approach amortizes interpretability costs by enabling a single high-quality atlas to illuminate and steer diverse models, potentially extending to attention heads and other latent spaces. The work highlights practical paths for scalable, model-agnostic interpretability with clear limitations and avenues for extension.

Abstract

Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires costly training of model-specific sparse autoencoders, manual or semi-automated labeling of SAE components, and their subsequent validation. We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas - a labeled, human-interpretable latent space - using only shared inputs and lightweight representational alignment techniques. Once aligned, this enables two key capabilities in previously opaque models: (1) semantic feature search and retrieval, and (2) steering generation along human-interpretable atlas concepts. Through quantitative and qualitative evaluations, we show that simple representational alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept data. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in one high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.

Atlas-Alignment: Making Interpretability Transferable Across Language Models

TL;DR

Atlas-Alignment proposes a scalable, cross-model interpretability framework that transfers the semantics of a human-labeled Concept Atlas to unknown subject models by aligning latent spaces with lightweight transformations. The core idea hinges on the Linear and Platonic Representation Hypotheses, enabling semantic querying, retrieval, and steering without training model-specific probes or labeled concept data. Empirical results show robust semantic translation, accurate feature retrieval, and controllable generation across multiple Llama models, with Orthogonal Procrustes consistently outperforming other alignment methods. This approach amortizes interpretability costs by enabling a single high-quality atlas to illuminate and steer diverse models, potentially extending to attention heads and other latent spaces. The work highlights practical paths for scalable, model-agnostic interpretability with clear limitations and avenues for extension.

Abstract

Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires costly training of model-specific sparse autoencoders, manual or semi-automated labeling of SAE components, and their subsequent validation. We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas - a labeled, human-interpretable latent space - using only shared inputs and lightweight representational alignment techniques. Once aligned, this enables two key capabilities in previously opaque models: (1) semantic feature search and retrieval, and (2) steering generation along human-interpretable atlas concepts. Through quantitative and qualitative evaluations, we show that simple representational alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept data. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in one high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.

Paper Structure

This paper contains 31 sections, 7 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Atlas-Alignment makes the latent space of a subject model interpretable by aligning it with a Concept Atlas --- a human-interpretable, labeled latent space. Left: The subject model’s hidden representations are mapped into the Concept Atlas, allowing each subject feature to be described as a linear combination of atlas concepts. Right: Once aligned, the method enables a range of interpretability tasks. (A) One or multiple concepts are selected from the Atlas, (B) corresponding subject model components are identified, or (C) the subject model’s output is steered along the concept direction.
  • Figure 2: Examples of using Atlas-Alignment for identification and steering. (A) A Concept Query is constructed from multiple Concept Atlas features related to the theme of "secrets and deception" and mapped into the latent spaces of two subject models. (B) In Llama-Base, the alignment reveals two features in layer 12 that encode relevant concepts. (C) In Llama-IT, the same Concept Query is used to steer generation across multiple layers, shifting outputs toward concept-related text.