Table of Contents
Fetching ...

Fusion Steering: Prompt-Specific Activation Control

Waldemar Chang, Alhassan Yasin

TL;DR

Fusion Steering addresses factual accuracy in QA by injecting prompt-specific activation deltas across the full network, guided by enriched reference activations. The method combines additive steering with interpolated patching and uses per-prompt optimization of fusion weight $\alpha$ and steering strength $\gamma$, comparing full-layer and segmented configurations. Evaluations on 260 challenging SimpleQA prompts show segmented steering delivering the largest gains, achieving a composite-score accuracy of 25.4% and 13.1% CORRECT under SimpleQA, outperforming baselines. The approach demonstrates that per-prompt, layer-aware activation interventions can meaningfully enhance factual grounding without fine-tuning and highlights connections to neuron-level interpretability for future scalable, interpretable control.

Abstract

We present Fusion Steering, an activation steering methodology that improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks. This approach introduces flexible steering configurations, including full-layer steering and segmented steering. Unlike traditional methods constrained to single-layer or fixed-layer operations, Fusion Steering employs dynamic injection of prompt-specific activation deltas across all transformer layers. These activation deltas are derived from reference completions that combine the ground-truth answer with a model-generated explanation to facilitate semantically enriched, example-specific steering. The injection weights are optimized per prompt using Optuna, targeting a joint objective that balances token overlap (factual alignment) and perplexity (fluency proxy). Evaluation employs a composite score integrating token overlap and LLM-graded quality, encompassing factual accuracy, coherence, and relevance. Empirical results on 260 SimpleQA prompts (selected from 500 where the baseline failed) showcase the efficacy of segmented steering. Using Gemma-2-2B-IT with 8-bit quantization, segmented steering achieves an accuracy of 25.4% (outputs scoring $\geq 0.6$), outperforming the baseline at 3.5% and full-layer steering at 16.2%. Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%. These findings highlight the strengths of segmented, dynamic intervention strategies and the promise of per-prompt, full-network activation control. Fusion Steering is also amenable to sparse representations, such as Neuronpedia or sparse crosscoders, suggesting a promising direction for interpretable and scalable activation-level control in LLMs.

Fusion Steering: Prompt-Specific Activation Control

TL;DR

Fusion Steering addresses factual accuracy in QA by injecting prompt-specific activation deltas across the full network, guided by enriched reference activations. The method combines additive steering with interpolated patching and uses per-prompt optimization of fusion weight and steering strength , comparing full-layer and segmented configurations. Evaluations on 260 challenging SimpleQA prompts show segmented steering delivering the largest gains, achieving a composite-score accuracy of 25.4% and 13.1% CORRECT under SimpleQA, outperforming baselines. The approach demonstrates that per-prompt, layer-aware activation interventions can meaningfully enhance factual grounding without fine-tuning and highlights connections to neuron-level interpretability for future scalable, interpretable control.

Abstract

We present Fusion Steering, an activation steering methodology that improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks. This approach introduces flexible steering configurations, including full-layer steering and segmented steering. Unlike traditional methods constrained to single-layer or fixed-layer operations, Fusion Steering employs dynamic injection of prompt-specific activation deltas across all transformer layers. These activation deltas are derived from reference completions that combine the ground-truth answer with a model-generated explanation to facilitate semantically enriched, example-specific steering. The injection weights are optimized per prompt using Optuna, targeting a joint objective that balances token overlap (factual alignment) and perplexity (fluency proxy). Evaluation employs a composite score integrating token overlap and LLM-graded quality, encompassing factual accuracy, coherence, and relevance. Empirical results on 260 SimpleQA prompts (selected from 500 where the baseline failed) showcase the efficacy of segmented steering. Using Gemma-2-2B-IT with 8-bit quantization, segmented steering achieves an accuracy of 25.4% (outputs scoring ), outperforming the baseline at 3.5% and full-layer steering at 16.2%. Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%. These findings highlight the strengths of segmented, dynamic intervention strategies and the promise of per-prompt, full-network activation control. Fusion Steering is also amenable to sparse representations, such as Neuronpedia or sparse crosscoders, suggesting a promising direction for interpretable and scalable activation-level control in LLMs.

Paper Structure

This paper contains 20 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Topic distribution comparison across all subsets. Proportions remain consistent despite sample size differences, indicating strong stratification.
  • Figure 2: Evaluation across three accuracy metrics. (Top Left) Accuracy based on combined factual and quality score, (Top Right) Accuracy based on token-level overlap, (Bottom Left) Accuracy based on rubric-assigned LLM scores, and (Bottom Right) Full distribution of LLM grades (1–5) assigned to model outputs. Segmented steering consistently outperforms all baselines across evaluation dimensions.
  • Figure 3: Score distributions for combined metric and token overlap. Boxplots illustrate variation in steering effectiveness across prompts.
  • Figure 4: Percentage of completions labeled as CORRECT under the SimpleQA rubric grading. Bar heights represent the proportion of examples (out of 260) graded as fully correct by an LLM-based evaluator. The baseline yielded no correct responses, while full-layer and segmented steering show substantial gains.