Atlas-Alignment: Making Interpretability Transferable Across Language Models

Bruno Puri; Jim Berend; Sebastian Lapuschkin; Wojciech Samek

Atlas-Alignment: Making Interpretability Transferable Across Language Models

Bruno Puri, Jim Berend, Sebastian Lapuschkin, Wojciech Samek

TL;DR

Atlas-Alignment proposes a scalable, cross-model interpretability framework that transfers the semantics of a human-labeled Concept Atlas to unknown subject models by aligning latent spaces with lightweight transformations. The core idea hinges on the Linear and Platonic Representation Hypotheses, enabling semantic querying, retrieval, and steering without training model-specific probes or labeled concept data. Empirical results show robust semantic translation, accurate feature retrieval, and controllable generation across multiple Llama models, with Orthogonal Procrustes consistently outperforming other alignment methods. This approach amortizes interpretability costs by enabling a single high-quality atlas to illuminate and steer diverse models, potentially extending to attention heads and other latent spaces. The work highlights practical paths for scalable, model-agnostic interpretability with clear limitations and avenues for extension.

Abstract

Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires costly training of model-specific sparse autoencoders, manual or semi-automated labeling of SAE components, and their subsequent validation. We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas - a labeled, human-interpretable latent space - using only shared inputs and lightweight representational alignment techniques. Once aligned, this enables two key capabilities in previously opaque models: (1) semantic feature search and retrieval, and (2) steering generation along human-interpretable atlas concepts. Through quantitative and qualitative evaluations, we show that simple representational alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept data. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in one high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.

Atlas-Alignment: Making Interpretability Transferable Across Language Models

TL;DR

Abstract

Atlas-Alignment: Making Interpretability Transferable Across Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)