Table of Contents
Fetching ...

Optimal ablation for interpretability

Maximilian Li, Lucas Janson

TL;DR

A new method is proposed, optimal ablation (OA), and it is shown that OA-based component importance has theoretical and empirical advantages over measuring importance via other ablation methods, and can benefit several downstream interpretability tasks.

Abstract

Interpretability studies often involve tracing the flow of information through machine learning models to identify specific model components that perform relevant computations for tasks of interest. Prior work quantifies the importance of a model component on a particular task by measuring the impact of performing ablation on that component, or simulating model inference with the component disabled. We propose a new method, optimal ablation (OA), and show that OA-based component importance has theoretical and empirical advantages over measuring importance via other ablation methods. We also show that OA-based component importance can benefit several downstream interpretability tasks, including circuit discovery, localization of factual recall, and latent prediction.

Optimal ablation for interpretability

TL;DR

A new method is proposed, optimal ablation (OA), and it is shown that OA-based component importance has theoretical and empirical advantages over measuring importance via other ablation methods, and can benefit several downstream interpretability tasks.

Abstract

Interpretability studies often involve tracing the flow of information through machine learning models to identify specific model components that perform relevant computations for tasks of interest. Prior work quantifies the importance of a model component on a particular task by measuring the impact of performing ablation on that component, or simulating model inference with the component disabled. We propose a new method, optimal ablation (OA), and show that OA-based component importance has theoretical and empirical advantages over measuring importance via other ablation methods. We also show that OA-based component importance can benefit several downstream interpretability tasks, including circuit discovery, localization of factual recall, and latent prediction.
Paper Structure (67 sections, 1 theorem, 43 equations, 18 figures, 3 tables, 1 algorithm)

This paper contains 67 sections, 1 theorem, 43 equations, 18 figures, 3 tables, 1 algorithm.

Key Result

Proposition 2.3

Let $\Delta(\mathcal{M},\mathcal{A})$ be the ablation loss gap for some component $\mathcal{A}$ measured with any total ablation method. Then, $\Delta_\mathrm{opt}(\mathcal{M},\mathcal{A})\leq \Delta(\mathcal{M},\mathcal{A})$.

Figures (18)

  • Figure 1: Left: Circuit discovery Pareto frontier for the IOI subtask with counterfactual ablation. Right: Comparison of ablation methods for circuit discovery on IOI (X indicates manual circuit evaluated on each ablation method). $\Delta$ is measured in KL-divergence.
  • Figure 2: Comparison of AIE with GNT and OAT. In the top figure, layer $\ell$ on the x-axis represents replacing a sliding window of 5 layers with $\ell$ as the median. Error bars indicate the sample estimate plus/minus two standard errors (details given in Appendix \ref{['appendix-tracing-ci']}).
  • Figure 3: Left: Prediction loss comparison between tuned lens and ablation-based alternatives. Middle, right: Causal faithfulness metrics for tuned and OCA lens under basis-aligned projections.
  • Figure 4: Comparison of calibrated elicitation accuracy on selected datasets.
  • Figure 5: Correlation of single-component ablation loss measurements on IOI. Lower triangle shows rank correlation and upper triangle shows log-log correlation across metrics.
  • ...and 13 more figures

Theorems & Definitions (6)

  • Definition 2.1: Total ablation
  • Definition 2.2: Optimal ablation
  • Proposition 2.3
  • proof
  • Definition C.1: Vertex activation patching
  • Definition C.2: Edge activation patching