Table of Contents
Fetching ...

Does Editing Provide Evidence for Localization?

Zihao Wang, Victor Veitch

TL;DR

The paper investigates whether edit-based interventions provide credible evidence that a small set of model components localizes a target behavior in LLMs. It develops an optimal-intervention framework (IPO) that aligns model edits with behavior and compares it to ITI heuristics, using TruthfulQA on an Alpaca-7B model. Surprisingly, both localized and randomly chosen heads can achieve near-optimal truthfulness–informativeness tradeoffs when edits are optimized, challenging the claim that observed edit effects pinpoint a true localization. The authors argue for precise, falsifiable definitions of localization and rigorous evaluation standards to avoid mistaking manipulation artifacts for genuine causal localization.

Abstract

A basic aspiration for interpretability research in large language models is to "localize" semantically meaningful behaviors to particular components within the LLM. There are various heuristics for finding candidate locations within the LLM. Once a candidate localization is found, it can be assessed by editing the internal representations at the corresponding localization and checking whether this induces model behavior that is consistent with the semantic interpretation of the localization. The question we address here is: how strong is the evidence provided by such edits? To evaluate the localization claim, we want to assess the effect of the optimal intervention at a particular location. The key new technical tool is a way of adapting LLM alignment techniques to find such optimal localized edits. With this tool in hand, we give an example where the edit-based evidence for localization appears strong, but where localization clearly fails. Indeed, we find that optimal edits at random localizations can be as effective as aligning the full model. In aggregate, our results suggest that merely observing that localized edits induce targeted changes in behavior provides little to no evidence that these locations actually encode the target behavior.

Does Editing Provide Evidence for Localization?

TL;DR

The paper investigates whether edit-based interventions provide credible evidence that a small set of model components localizes a target behavior in LLMs. It develops an optimal-intervention framework (IPO) that aligns model edits with behavior and compares it to ITI heuristics, using TruthfulQA on an Alpaca-7B model. Surprisingly, both localized and randomly chosen heads can achieve near-optimal truthfulness–informativeness tradeoffs when edits are optimized, challenging the claim that observed edit effects pinpoint a true localization. The authors argue for precise, falsifiable definitions of localization and rigorous evaluation standards to avoid mistaking manipulation artifacts for genuine causal localization.

Abstract

A basic aspiration for interpretability research in large language models is to "localize" semantically meaningful behaviors to particular components within the LLM. There are various heuristics for finding candidate locations within the LLM. Once a candidate localization is found, it can be assessed by editing the internal representations at the corresponding localization and checking whether this induces model behavior that is consistent with the semantic interpretation of the localization. The question we address here is: how strong is the evidence provided by such edits? To evaluate the localization claim, we want to assess the effect of the optimal intervention at a particular location. The key new technical tool is a way of adapting LLM alignment techniques to find such optimal localized edits. With this tool in hand, we give an example where the edit-based evidence for localization appears strong, but where localization clearly fails. Indeed, we find that optimal edits at random localizations can be as effective as aligning the full model. In aggregate, our results suggest that merely observing that localized edits induce targeted changes in behavior provides little to no evidence that these locations actually encode the target behavior.

Paper Structure

This paper contains 10 sections, 5 equations, 5 figures.

Figures (5)

  • Figure 1: Localized heads perform much better than random when using ITI interventions. We observe better Info*Truth scores, better truth-info score tradeoff, as well as better MC-KL tradeoff.
  • Figure 2: IPO interventions achieve much better performance than using ITI. Using IPO interventions at localized heads give nearly optimal info-truth tradeoff as well.
  • Figure 3: Using IPO optimal localized interventions, randomly selected heads perform nearly optimally for steering model generations. In particular, random heads are as good as the conjectured localized heads. The random heads are the same as those in \ref{['fig:iti_truth_info']}.
  • Figure 4: Using a single-head is as effective, and there are multiple of them!
  • Figure 5: Probing-localized heads seem somewhat special in MC scores.