Table of Contents
Fetching ...

Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models

Alexandre Variengien, Eric Winsor

TL;DR

We introduce ORION, a structured retrieval-task suite to enable scalable causal analysis of language-model internals across diverse domains and architectures. Through residual-stream patching, we reveal a universal macroscopic modular decomposition in which mid-layer processing encodes the user request and late layers perform context retrieval, a pattern observed across 18 open-source models and 6 task types. A fine-grained Pythia-2.8b case study corroborates the macro-micro picture and demonstrates a proof-of-concept for scalable internal oversight that mitigates prompt injection with human supervision on a single input. These findings support a top-down, data-centric interpretability framework that can guide robust LM oversight and safer deployment of general-purpose systems.

Abstract

When solving challenging problems, language models (LMs) are able to identify relevant information from long and complicated contexts. To study how LMs solve retrieval tasks in diverse situations, we introduce ORION, a collection of structured retrieval tasks spanning six domains, from text understanding to coding. Each task in ORION can be represented abstractly by a request (e.g. a question) that retrieves an attribute (e.g. the character name) from a context (e.g. a story). We apply causal analysis on 18 open-source language models with sizes ranging from 125 million to 70 billion parameters. We find that LMs internally decompose retrieval tasks in a modular way: middle layers at the last token position process the request, while late layers retrieve the correct entity from the context. After causally enforcing this decomposition, models are still able to solve the original task, preserving 70% of the original correct token probability in 98 of the 106 studied model-task pairs. We connect our macroscopic decomposition with a microscopic description by performing a fine-grained case study of a question-answering task on Pythia-2.8b. Building on our high-level understanding, we demonstrate a proof of concept application for scalable internal oversight of LMs to mitigate prompt-injection while requiring human supervision on only a single input. Our solution improves accuracy drastically (from 15.5% to 97.5% on Pythia-12b). This work presents evidence of a universal emergent modular processing of tasks across varied domains and models and is a pioneering effort in applying interpretability for scalable internal oversight of LMs.

Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models

TL;DR

We introduce ORION, a structured retrieval-task suite to enable scalable causal analysis of language-model internals across diverse domains and architectures. Through residual-stream patching, we reveal a universal macroscopic modular decomposition in which mid-layer processing encodes the user request and late layers perform context retrieval, a pattern observed across 18 open-source models and 6 task types. A fine-grained Pythia-2.8b case study corroborates the macro-micro picture and demonstrates a proof-of-concept for scalable internal oversight that mitigates prompt injection with human supervision on a single input. These findings support a top-down, data-centric interpretability framework that can guide robust LM oversight and safer deployment of general-purpose systems.

Abstract

When solving challenging problems, language models (LMs) are able to identify relevant information from long and complicated contexts. To study how LMs solve retrieval tasks in diverse situations, we introduce ORION, a collection of structured retrieval tasks spanning six domains, from text understanding to coding. Each task in ORION can be represented abstractly by a request (e.g. a question) that retrieves an attribute (e.g. the character name) from a context (e.g. a story). We apply causal analysis on 18 open-source language models with sizes ranging from 125 million to 70 billion parameters. We find that LMs internally decompose retrieval tasks in a modular way: middle layers at the last token position process the request, while late layers retrieve the correct entity from the context. After causally enforcing this decomposition, models are still able to solve the original task, preserving 70% of the original correct token probability in 98 of the 106 studied model-task pairs. We connect our macroscopic decomposition with a microscopic description by performing a fine-grained case study of a question-answering task on Pythia-2.8b. Building on our high-level understanding, we demonstrate a proof of concept application for scalable internal oversight of LMs to mitigate prompt-injection while requiring human supervision on only a single input. Our solution improves accuracy drastically (from 15.5% to 97.5% on Pythia-12b). This work presents evidence of a universal emergent modular processing of tasks across varied domains and models and is a pioneering effort in applying interpretability for scalable internal oversight of LMs.
Paper Structure (65 sections, 10 equations, 29 figures, 5 tables)

This paper contains 65 sections, 10 equations, 29 figures, 5 tables.

Figures (29)

  • Figure 1: Illustration of our main experimental discovery. Patching the mid-layer residual stream on a retrieval task from ORION causes the language model to output a modular combination of the request from $x_1$ (asking for the city) and the context from $x_2$ (a story about Bob in Paris). We call this phenomenon request-patching.
  • Figure 2: Illustration of our two central methods: precise dataset design and causal intervention. Left: Informal illustration of different types of datasets for investigating the abilities of language models. Benchmarks such as MMLU aim at faithfully mapping the frontier of language models' abilities through behavioral analysis. However, because of experimental constraints, interpretability case studies are often focused on a narrow task. By introducing ORION we propose a middle-ground: it is a collection of narrow tasks spanning different domains while being connected by a common structure, enabling the design of causal experiments at scale. Right: An interchange intervention (here $\mathcal{M} (x|z_1^1\leftarrow z_1^1(x'))$) is a causal intervention where an activation (here the residual stream $z_1^1$) is replaced by its value on a source input ($x'=$"When Alice lived") while the rest of the model is run on the target input ($x=$"Then Bob went"). The activations downstream of the intervened node depend on the two inputs (in yellow).
  • Figure 3: Example input from ORION for the question-answering task. Textual inputs are annotated with an abstract representation of the context and the request. Abstract context representations are tables where each line lists attributes relative to a story element. Requests can be formulated using simple SQL-like queries.
  • Figure 4: The semi-automatic task generation process used to create ORION. We use ChatGPT to create a template and values for the placeholders given a problem type. To generate an instance from the task, we start by randomly selecting placeholder values to create an abstract input representation. Then, we use a format string to fill the template. When we need more flexibility, we use GPT-4 to incorporate the placeholder values into the template.
  • Figure 5: Accuracy of 18 models on the ORION task collection. Models with more than 7 billion parameters are able to robustly solve every task. However, simple tasks such as the base question-answering can be solved by models as small as GPT-2 small (125 million parameters), enabling comparative studies across a wide range of model scales.
  • ...and 24 more figures