RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger
TL;DR
RAVEL introduces a controlled benchmark for evaluating interpretability methods on their ability to disentangle attribute representations in language models. It defines an interchange-intervention framework and a set of evaluation metrics to quantify causal disentanglement, then systematically compares multiple methods, including PCA, sparse autoencoders, RLAP, DBM, DAS, and multitask variants. The study finds that counterfactual-supervised methods, particularly the proposed Multi-task DAS (MDAS), achieve the strongest disentanglement with relatively small intervention dimensions, while some attribute pairs remain intrinsically entangled. Results indicate that disentanglement improves across model layers, underscoring the value of distributed representations over neuron-local explanations. RAVEL and MDAS together offer a scalable, generalizable approach for assessing and advancing interpretability in large language models, with the dataset and code released for community use.
Abstract
Individual neurons participate in the representation of multiple high-level concepts. To what extent can different interpretability methods successfully disentangle these roles? To help address this question, we introduce RAVEL (Resolving Attribute-Value Entanglements in Language Models), a dataset that enables tightly controlled, quantitative comparisons between a variety of existing interpretability methods. We use the resulting conceptual framework to define the new method of Multi-task Distributed Alignment Search (MDAS), which allows us to find distributed representations satisfying multiple causal criteria. With Llama2-7B as the target language model, MDAS achieves state-of-the-art results on RAVEL, demonstrating the importance of going beyond neuron-level analyses to identify features distributed across activations. We release our benchmark at https://github.com/explanare/ravel.
