Fusion Steering: Prompt-Specific Activation Control

Waldemar Chang; Alhassan Yasin

Fusion Steering: Prompt-Specific Activation Control

Waldemar Chang, Alhassan Yasin

TL;DR

Fusion Steering addresses factual accuracy in QA by injecting prompt-specific activation deltas across the full network, guided by enriched reference activations. The method combines additive steering with interpolated patching and uses per-prompt optimization of fusion weight $\alpha$ and steering strength $\gamma$, comparing full-layer and segmented configurations. Evaluations on 260 challenging SimpleQA prompts show segmented steering delivering the largest gains, achieving a composite-score accuracy of 25.4% and 13.1% CORRECT under SimpleQA, outperforming baselines. The approach demonstrates that per-prompt, layer-aware activation interventions can meaningfully enhance factual grounding without fine-tuning and highlights connections to neuron-level interpretability for future scalable, interpretable control.

Abstract

We present Fusion Steering, an activation steering methodology that improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks. This approach introduces flexible steering configurations, including full-layer steering and segmented steering. Unlike traditional methods constrained to single-layer or fixed-layer operations, Fusion Steering employs dynamic injection of prompt-specific activation deltas across all transformer layers. These activation deltas are derived from reference completions that combine the ground-truth answer with a model-generated explanation to facilitate semantically enriched, example-specific steering. The injection weights are optimized per prompt using Optuna, targeting a joint objective that balances token overlap (factual alignment) and perplexity (fluency proxy). Evaluation employs a composite score integrating token overlap and LLM-graded quality, encompassing factual accuracy, coherence, and relevance. Empirical results on 260 SimpleQA prompts (selected from 500 where the baseline failed) showcase the efficacy of segmented steering. Using Gemma-2-2B-IT with 8-bit quantization, segmented steering achieves an accuracy of 25.4% (outputs scoring $\geq 0.6$), outperforming the baseline at 3.5% and full-layer steering at 16.2%. Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%. These findings highlight the strengths of segmented, dynamic intervention strategies and the promise of per-prompt, full-network activation control. Fusion Steering is also amenable to sparse representations, such as Neuronpedia or sparse crosscoders, suggesting a promising direction for interpretable and scalable activation-level control in LLMs.

Fusion Steering: Prompt-Specific Activation Control

TL;DR

Abstract

Fusion Steering: Prompt-Specific Activation Control

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)