RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

Jing Huang; Zhengxuan Wu; Christopher Potts; Mor Geva; Atticus Geiger

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger

TL;DR

RAVEL introduces a controlled benchmark for evaluating interpretability methods on their ability to disentangle attribute representations in language models. It defines an interchange-intervention framework and a set of evaluation metrics to quantify causal disentanglement, then systematically compares multiple methods, including PCA, sparse autoencoders, RLAP, DBM, DAS, and multitask variants. The study finds that counterfactual-supervised methods, particularly the proposed Multi-task DAS (MDAS), achieve the strongest disentanglement with relatively small intervention dimensions, while some attribute pairs remain intrinsically entangled. Results indicate that disentanglement improves across model layers, underscoring the value of distributed representations over neuron-local explanations. RAVEL and MDAS together offer a scalable, generalizable approach for assessing and advancing interpretability in large language models, with the dataset and code released for community use.

Abstract

Individual neurons participate in the representation of multiple high-level concepts. To what extent can different interpretability methods successfully disentangle these roles? To help address this question, we introduce RAVEL (Resolving Attribute-Value Entanglements in Language Models), a dataset that enables tightly controlled, quantitative comparisons between a variety of existing interpretability methods. We use the resulting conceptual framework to define the new method of Multi-task Distributed Alignment Search (MDAS), which allows us to find distributed representations satisfying multiple causal criteria. With Llama2-7B as the target language model, MDAS achieves state-of-the-art results on RAVEL, demonstrating the importance of going beyond neuron-level analyses to identify features distributed across activations. We release our benchmark at https://github.com/explanare/ravel.

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

TL;DR

Abstract

Paper Structure (46 sections, 18 equations, 4 figures, 8 tables)

This paper contains 46 sections, 18 equations, 4 figures, 8 tables.

Introduction
The Ravel Dataset
The Attribute Disentanglement Task
Data Generation
Selecting Entity Types and Attributes
Constructing Prompts
Generating Splits
Filtering for a Specific Model
Interpretability Evaluation
Interchange Interventions
Evaluation Data
Metrics
Interpretability Methods
PCA
Sparse Autoencoder
...and 31 more sections

Figures (4)

Figure 1: An overview of the Ravel benchmark, which evaluates how well an interpretability method can find features that isolate the causal effect of individual attributes of an entity.
Figure 2: $\texttt{Cause}$ and $\texttt{Iso}$ scores for each method when using different feature sizes, shown as the ratio (%) between the dimension of $F_{A}$ and the dimension of the output space of $\mathcal{F}$. Each method has three data points that vary from using very few ($\approx$1%) to half ($\approx$50%) of the dimensions. Increasing feature dimensions generally leads to higher $\texttt{Cause}$ score, but lower $\texttt{Iso}$ score. Figure best viewed in color.
Figure 3: Additional results for the MDAS method.
Figure 4: Additional feature disentanglement results for RLAP, DBM, and MDBM methods.

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

TL;DR

Abstract

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (4)