Table of Contents
Fetching ...

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

Mohammed Suhail B Nadaf

Abstract

Function vectors (FVs) -- mean-difference directions extracted from in-context learning demonstrations -- can steer large language model behavior when added to the residual stream. We hypothesized that FV steering failures reflect an absence of task-relevant information: the logit lens would fail alongside steering. We were wrong. In the most comprehensive cross-template FV transfer study to date - 4,032 pairs across 12 tasks, 6 models from 3 families (Llama-3.1-8B, Gemma-2-9B, Mistral-7B-v0.3; base and instruction-tuned), 8 templates per task - we find the opposite dissociation: FV steering succeeds even when the logit lens cannot decode the correct answer at any layer. This steerability-without-decodability pattern is universal: steering exceeds logit lens accuracy for every task on every model, with gaps as large as -0.91. Only 3 of 72 task-model instances show the predicted decodable-without-steerable pattern, all in Mistral. FV vocabulary projection reveals that FVs achieving over 0.90 steering accuracy still project to incoherent token distributions, indicating FVs encode computational instructions rather than answer directions. FVs intervene optimally at early layers (L2-L8); the logit lens detects correct answers only at late layers (L28-L32). The previously reported negative cosine-transfer correlation (r=-0.572) dissolves at scale: pooled r ranges from -0.199 to +0.126, and cosine adds less than 0.011 in R-squared beyond task identity. Post-steering analysis reveals a model-family divergence: Mistral FVs rewrite intermediate representations; Llama/Gemma FVs produce near-zero changes despite successful steering. Activation patching confirms causal localization: easy tasks achieve perfect recovery at targeted layers; hard tasks show zero recovery everywhere.

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

Abstract

Function vectors (FVs) -- mean-difference directions extracted from in-context learning demonstrations -- can steer large language model behavior when added to the residual stream. We hypothesized that FV steering failures reflect an absence of task-relevant information: the logit lens would fail alongside steering. We were wrong. In the most comprehensive cross-template FV transfer study to date - 4,032 pairs across 12 tasks, 6 models from 3 families (Llama-3.1-8B, Gemma-2-9B, Mistral-7B-v0.3; base and instruction-tuned), 8 templates per task - we find the opposite dissociation: FV steering succeeds even when the logit lens cannot decode the correct answer at any layer. This steerability-without-decodability pattern is universal: steering exceeds logit lens accuracy for every task on every model, with gaps as large as -0.91. Only 3 of 72 task-model instances show the predicted decodable-without-steerable pattern, all in Mistral. FV vocabulary projection reveals that FVs achieving over 0.90 steering accuracy still project to incoherent token distributions, indicating FVs encode computational instructions rather than answer directions. FVs intervene optimally at early layers (L2-L8); the logit lens detects correct answers only at late layers (L28-L32). The previously reported negative cosine-transfer correlation (r=-0.572) dissolves at scale: pooled r ranges from -0.199 to +0.126, and cosine adds less than 0.011 in R-squared beyond task identity. Post-steering analysis reveals a model-family divergence: Mistral FVs rewrite intermediate representations; Llama/Gemma FVs produce near-zero changes despite successful steering. Activation patching confirms causal localization: easy tasks achieve perfect recovery at targeted layers; hard tasks show zero recovery everywhere.

Paper Structure

This paper contains 54 sections, 4 equations, 7 figures, 25 tables.

Figures (7)

  • Figure 1: IID steering accuracy across all 12 tasks for Llama-3.1-8B Base. The horizontal dashed line marks the IID threshold ($\tau = 0.10$). Morphological tasks dramatically exceed predictions. Character-level capitalize and first_letter succeed (falsifying negative-control predictions), while only reverse_word and sentiment_flip consistently fail.
  • Figure 2: Cosine similarity vs. OOD transfer accuracy for all cross-template pairs, colored by task. The pooled correlation is near zero (Llama Base: $r = 0.013$, Gemma Base: $r = 0.058$), dissolving the previously reported Simpson's paradox ($r = -0.572$). Points at high cosine ($>0.80$) span the full range of transfer accuracy, confirming that geometric alignment does not predict functional transfer.
  • Figure 3: FV steering accuracy vs. logit lens top-10 accuracy at the best layer for all 12 tasks. The gap between bars is the steerability-decodability gap. For all tasks across all models, FV steering accuracy meets or exceeds logit lens accuracy. The universal direction of this gap---steering $>$ readability---is the paper's core finding. Most dramatic gaps: country_capital (Llama Base: $0.880$ steering vs.$0.056$ readability, gap $= -0.82$), first_letter (Llama IT: $0.960$vs.$0.047$, gap $= -0.91$).
  • Figure 4: Layer-wise FV steering accuracy vs. logit lens accuracy for representative tasks. For all tasks, FV steering peaks at earlier layers (L2--L8) than logit lens readability (L28--L32). For country_capital and first_letter, the logit lens is near zero at all layers while FV steering succeeds, demonstrating pure steerability-without-decodability. The temporal dissociation---early FV intervention, late readability emergence---supports the "FV as computational instruction" interpretation.
  • Figure 5: FV vocabulary projection results for Llama-3.1-8B Base. The near-universal incoherence of FV vocabulary projections, even for highly effective FVs ($>0.90$ steering accuracy), demonstrates that FVs encode computational instructions rather than answer directions. The partial exception is first_letter (task-relevant fraction $\sim 0.82$): the FV projects to single-character tokens, though not to the correct specific letters.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 1: IID gating