Table of Contents
Fetching ...

Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection

Minjae Kang, Jaehyung Kim

TL;DR

DIRECTER (Dynamic rejection steering), a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset, demonstrates its potential as a general mechanism for mitigating oversteering that is compatible with existing baselines.

Abstract

Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of oversteering, where excessive emphasis on the instruction degrades task accuracy and overall text quality. To address this, we introduce DIRECTER (Dynamic rejection steering), a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset. DIRECTER couples steering with a plausibility-guided decoding loop, which adaptively adjusts steering strength at each step by comparing the steered output distribution to the original. If the steered output is deemed implausible, steering strength is progressively weakened. This strength modulation is guided by a lightweight, one-time attention sensitivity analysis that ranks layers by their influence on model representations. Extensive evaluations show that DIRECTER significantly enhances instruction-following capabilities across diverse benchmarks, improving accuracy by up to 6.5% over baselines without the common trade-offs in generation quality or task fidelity. The proposed dynamic, plausibility-guided control during activation steering further demonstrates its potential as a general mechanism for mitigating oversteering that is compatible with existing baselines.

Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection

TL;DR

DIRECTER (Dynamic rejection steering), a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset, demonstrates its potential as a general mechanism for mitigating oversteering that is compatible with existing baselines.

Abstract

Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of oversteering, where excessive emphasis on the instruction degrades task accuracy and overall text quality. To address this, we introduce DIRECTER (Dynamic rejection steering), a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset. DIRECTER couples steering with a plausibility-guided decoding loop, which adaptively adjusts steering strength at each step by comparing the steered output distribution to the original. If the steered output is deemed implausible, steering strength is progressively weakened. This strength modulation is guided by a lightweight, one-time attention sensitivity analysis that ranks layers by their influence on model representations. Extensive evaluations show that DIRECTER significantly enhances instruction-following capabilities across diverse benchmarks, improving accuracy by up to 6.5% over baselines without the common trade-offs in generation quality or task fidelity. The proposed dynamic, plausibility-guided control during activation steering further demonstrates its potential as a general mechanism for mitigating oversteering that is compatible with existing baselines.
Paper Structure (69 sections, 9 equations, 17 figures, 27 tables, 2 algorithms)

This paper contains 69 sections, 9 equations, 17 figures, 27 tables, 2 algorithms.

Figures (17)

  • Figure 1: An overview of Directer's plausibility-guided decoding loop. At each step, a steered output distribution ($\tilde{p}_t$) from KV cache scaling is compared against the raw output distribution ($p_t$). (a) Steering Failure: If the steered candidate is deemed implausible, it is rejected, triggering a progressive reduction of steered layers to weaken the intervention. (b) Steering Success: If the candidate is plausible, it is accepted for decoding.
  • Figure 2: Ablation studies for plausibility-guided decoding.(a) Performance of Directer compared to variants using a fixed steering strength (ST$k$, where steering is applied to $2^{k-1}$ top-ranked layers). (b) Applying our plausibility filter to other steering methods mitigates oversteering and improves performance. For each method, the solid bar represents the results from original version, while the hatched bar includes our filter.
  • Figure 3: Robustness analysis of Directer. The gray dashed line indicates the baseline. Black dotted denotes accuracy (%), and red diamond denotes change rate (%). (a) Performance across different plausibility thresholds ($\beta$). (b) Stability across a wide range of scaling factors ($\alpha$). (c) Robustness to different prompts, showing consistent improvement on the average of four variants.
  • Figure 4: Inference efficiency analysis.Directer maintains (a) competitive throughput and (b) per-token decoding speed while (c) adding negligible memory overhead.
  • Figure 5: The prompt template used for rewriting IFEval samples.
  • ...and 12 more figures