Table of Contents
Fetching ...

Compressed Sensing for Capability Localization in Large Language Models

Anna Bair, Yixuan Even Xu, Mingjie Sun, J. Zico Kolter

TL;DR

This work introduces a compressed sensing based method that exploits the sparsity of these heads to identify them via strategic knockouts and a small number of model evaluations, revealing a modular organization in which specialized capabilities are implemented by sparse, functionally distinct components.

Abstract

Large language models (LLMs) exhibit a wide range of capabilities, including mathematical reasoning, code generation, and linguistic behaviors. We show that many capabilities are highly localized to small subsets of attention heads within Transformer architectures. Zeroing out as few as five task-specific heads can degrade performance by up to $65\%$ on standard benchmarks measuring the capability of interest, while largely preserving performance on unrelated tasks. We introduce a compressed sensing based method that exploits the sparsity of these heads to identify them via strategic knockouts and a small number of model evaluations. We validate these findings across Llama and Qwen models ranging from 1B to 8B parameters and a diverse set of capabilities including mathematical abilities and code generation, revealing a modular organization in which specialized capabilities are implemented by sparse, functionally distinct components. Overall, our results suggest that capability localization is a general organizational principle of Transformer language models, with implications for interpretability, model editing, and AI safety. Code is released at https://github.com/locuslab/llm-components.

Compressed Sensing for Capability Localization in Large Language Models

TL;DR

This work introduces a compressed sensing based method that exploits the sparsity of these heads to identify them via strategic knockouts and a small number of model evaluations, revealing a modular organization in which specialized capabilities are implemented by sparse, functionally distinct components.

Abstract

Large language models (LLMs) exhibit a wide range of capabilities, including mathematical reasoning, code generation, and linguistic behaviors. We show that many capabilities are highly localized to small subsets of attention heads within Transformer architectures. Zeroing out as few as five task-specific heads can degrade performance by up to on standard benchmarks measuring the capability of interest, while largely preserving performance on unrelated tasks. We introduce a compressed sensing based method that exploits the sparsity of these heads to identify them via strategic knockouts and a small number of model evaluations. We validate these findings across Llama and Qwen models ranging from 1B to 8B parameters and a diverse set of capabilities including mathematical abilities and code generation, revealing a modular organization in which specialized capabilities are implemented by sparse, functionally distinct components. Overall, our results suggest that capability localization is a general organizational principle of Transformer language models, with implications for interpretability, model editing, and AI safety. Code is released at https://github.com/locuslab/llm-components.
Paper Structure (37 sections, 1 equation, 4 figures, 10 tables, 1 algorithm)

This paper contains 37 sections, 1 equation, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: Knocking out the top five math heads identified by our compressed sensing method greatly reduces performance on math datasets (GSM8K and Arithmetic) while leaving other tasks relatively unaffected.
  • Figure 2: Ablating increasing numbers of top math heads degrades GSM8K and Arithmetic performance while leaving general language abilities (average of HellaSwag, BoolQ, Arc Challenge, and MMLU) unaffected. We begin to see diminishing returns on task-specific degradation as we continue ablating additional heads.
  • Figure 3: Knowledge-based multiple choice heads identified in Llama 3.2 3B and Llama 3.2 1B impact all WMDP subtypes as well as MMLU. Ablating just one of these heads reduces performance on WMDP and MMLU to random chance (approximately $25\%$) while leaving general performance (average across HellaSwag, BoolQ, and Arc C) unaffected.
  • Figure 4: WMDP subtasks have very weak localization in Llama 3.1 8B.