Table of Contents
Fetching ...

Challenges in Mechanistically Interpreting Model Representations

Satvik Golechha, James Dao

TL;DR

This paper argues that mechanistic interpretability should focus on internal representations rather than token-aligned prompts to understand and control model behavior, especially for safety-critical capabilities. It formalizes the notion of input features and output behaviors as representational directions and evaluates them through a case study on dishonesty in Mistral-7B-Instruct-v0.1, using linear representations and existing MI tools. The findings reveal that current MI methods struggle to fully explain how representations form or drive long-horizon generation, showing that dishonesty directions are distributed across many components and require continual injections, with patching revealing dense, distributed circuits. The work underscores the need for new framework-level approaches to study representations and highlights implications for safety, alignment, and governance in AI systems.

Abstract

Mechanistic interpretability (MI) aims to understand AI models by reverse-engineering the exact algorithms neural networks learn. Most works in MI so far have studied behaviors and capabilities that are trivial and token-aligned. However, most capabilities important for safety and trust are not that trivial, which advocates for the study of hidden representations inside these networks as the unit of analysis. We formalize representations for features and behaviors, highlight their importance and evaluation, and perform an exploratory study of dishonesty representations in `Mistral-7B-Instruct-v0.1'. We justify that studying representations is an important and under-studied field, and highlight several challenges that arise while attempting to do so through currently established methods in MI, showing their insufficiency and advocating work on new frameworks for the same.

Challenges in Mechanistically Interpreting Model Representations

TL;DR

This paper argues that mechanistic interpretability should focus on internal representations rather than token-aligned prompts to understand and control model behavior, especially for safety-critical capabilities. It formalizes the notion of input features and output behaviors as representational directions and evaluates them through a case study on dishonesty in Mistral-7B-Instruct-v0.1, using linear representations and existing MI tools. The findings reveal that current MI methods struggle to fully explain how representations form or drive long-horizon generation, showing that dishonesty directions are distributed across many components and require continual injections, with patching revealing dense, distributed circuits. The work underscores the need for new framework-level approaches to study representations and highlights implications for safety, alignment, and governance in AI systems.

Abstract

Mechanistic interpretability (MI) aims to understand AI models by reverse-engineering the exact algorithms neural networks learn. Most works in MI so far have studied behaviors and capabilities that are trivial and token-aligned. However, most capabilities important for safety and trust are not that trivial, which advocates for the study of hidden representations inside these networks as the unit of analysis. We formalize representations for features and behaviors, highlight their importance and evaluation, and perform an exploratory study of dishonesty representations in `Mistral-7B-Instruct-v0.1'. We justify that studying representations is an important and under-studied field, and highlight several challenges that arise while attempting to do so through currently established methods in MI, showing their insufficiency and advocating work on new frameworks for the same.
Paper Structure (29 sections, 4 equations, 12 figures, 1 table)

This paper contains 29 sections, 4 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Hidden representations inside models have meaningful geometric and semantic interpretations. Left: Part segmentation in DINOv2 oquab2023dinov2. Middle: algebraic semantics in word vectors mikolov2013linguistic. Right: Local coordinates in StyleGAN3 karras2021alias. Figures adapted from these works and taken from a similar illustration in zou2023representation.
  • Figure 2: Cosine similarities of dishonesty directions for each layer. Note that nearer layers have similar directions.
  • Figure 3: Data splitting from truthfulness directions from marks2023geometry split datapoints with a $95\%$ accuracy.
  • Figure 4: Difference in log probs. of the dishonest token with and without dishonesty injection. Tokens in "red" have a large difference, and only a fraction of the token positions require an injection for dishonest generation.
  • Figure 5: Direct logit attribution on one datapoint. Note the change in contributions of each component with changing $\alpha$ and a significantly larger contribution coming from MLP layers.
  • ...and 7 more figures