Table of Contents
Fetching ...

Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations

Sanjay Basu, Sadiq Y. Patel, Parth Sheth, Bhairavi Muralidharan, Namrata Elamaran, Aakriti Kinra, John Morgan, Rajaie Batniji

Abstract

Language models encode task-relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpretability methods can bridge this knowledge-action gap has not been systematically tested. We compared four mechanistic interpretability methods -- concept bottleneck steering (Steerling-8B), sparse autoencoder feature steering, logit lens with activation patching, and linear probing with truthfulness separator vector steering (Qwen 2.5 7B Instruct) -- for correcting false-negative triage errors using 400 physician-adjudicated clinical vignettes (144 hazards, 256 benign). Linear probes discriminated hazardous from benign cases with 98.2% AUROC, yet the model's output sensitivity was only 45.1%, a 53-percentage-point knowledge-action gap. Concept bottleneck steering corrected 20% of missed hazards but disrupted 53% of correct detections, indistinguishable from random perturbation (p=0.84). SAE feature steering produced zero effect despite 3,695 significant features. TSV steering at high strength corrected 24% of missed hazards while disrupting 6% of correct detections, but left 76% of errors uncorrected. Current mechanistic interpretability methods cannot reliably translate internal knowledge into corrected outputs, with implications for AI safety frameworks that assume interpretability enables effective error correction.

Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations

Abstract

Language models encode task-relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpretability methods can bridge this knowledge-action gap has not been systematically tested. We compared four mechanistic interpretability methods -- concept bottleneck steering (Steerling-8B), sparse autoencoder feature steering, logit lens with activation patching, and linear probing with truthfulness separator vector steering (Qwen 2.5 7B Instruct) -- for correcting false-negative triage errors using 400 physician-adjudicated clinical vignettes (144 hazards, 256 benign). Linear probes discriminated hazardous from benign cases with 98.2% AUROC, yet the model's output sensitivity was only 45.1%, a 53-percentage-point knowledge-action gap. Concept bottleneck steering corrected 20% of missed hazards but disrupted 53% of correct detections, indistinguishable from random perturbation (p=0.84). SAE feature steering produced zero effect despite 3,695 significant features. TSV steering at high strength corrected 24% of missed hazards while disrupting 6% of correct detections, but left 76% of errors uncorrected. Current mechanistic interpretability methods cannot reliably translate internal knowledge into corrected outputs, with implications for AI safety frameworks that assume interpretability enables effective error correction.
Paper Structure (52 sections, 15 equations, 5 figures, 9 tables)

This paper contains 52 sections, 15 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Dose-response relationship between concept steering strength and triage outcomes in Steerling-8B. False-negative correction rate (solid blue) and true-positive disruption rate (dashed red) as a function of concept activation alpha level for hazard-associated concept steering in the physician-created dataset (85 false negatives, 47 true positives at baseline). Error bars indicate 95% Wilson score confidence intervals. At every alpha level, the true-positive disruption rate exceeds the false-negative correction rate. The random-concept control (triangles) shows comparable correction and disruption rates, indicating that hazard-concept steering is no more effective than random perturbation.
  • Figure 2: Per-layer probe AUROC and TSV discrimination in Qwen 2.5 7B. Cross-validated AUROC (5-fold stratified) for L2-regularised logistic regression probes trained at each of the 28 layers of Qwen 2.5 7B to discriminate physician-adjudicated hazardous from benign cases ($n=400$). Probes achieve AUROC above 0.95 at all layers, peaking at 0.982 (95% CI 0.968 to 0.993) at layer 23 (arrow). The dashed horizontal line indicates the model's actual output sensitivity of 0.451. At layer 23, the truthfulness separator vector discriminates true-positive from false-negative cases with AUROC 0.814 (95% CI 0.738 to 0.887). The gap between what the model knows internally and what it does quantifies the knowledge-action gap that TSV steering only partially bridged.
  • Figure 3: Concept activation heatmap by hazard category. Mean concept activation (sigmoid space) for the top 10 concepts across 18 hazard categories and benign cases. Darker cells indicate higher mean activation. The sparsity of the heatmap reflects the overall sparsity of the concept activation space (99.92% of activations $< 0.01$).
  • Figure 4: False-negative correction and true-positive disruption rates for concept-level interventions. (A) False-negative correction rate for hazard-concept amplification ($\alpha = 1.0$), random-concept amplification ($\alpha = 1.0$), and prompt engineering. (B) True-positive disruption rate for concept suppression ($\alpha = 0.0$): hazard-concept suppression disrupted 53.2% of correct detections, comparable to random-concept suppression (61.7%). Error bars indicate 95% Wilson score CIs.
  • Figure 5: Triage sensitivity and specificity by demographic group (Steerling-8B, 200 physician-created vignettes with racial/ethnic descriptors). Sensitivity ranged from 0.371 (Hispanic) to 0.485 (Black); differences were not statistically significant ($\chi^2$ test, $p = 0.24$). Specificity was similar across groups (0.647 to 0.691).