Table of Contents
Fetching ...

Energy-Driven Steering: Reducing False Refusals in Large Language Models

Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, Xinfeng Li

TL;DR

Energy-Driven Steering (EDS) tackles the false-refusal issue in LLM safety alignment by introducing an external Energy-Based Model (EBM) that defines an energy landscape over internal activations. By training EBMs per layer with an InfoNCE objective and applying real-time gradient-based steering during inference, EDS lowers energy for desirable trajectories and raises it for undesirable ones, steering the model toward helpful outputs without weight updates. The approach achieves substantial reductions in false refusals while preserving safety benchmarks and general capabilities, outperforming both fine-tuning-free and fine-tuning methods in key metrics and demonstrating robustness in multi-turn settings with efficient inference overhead. This provides a practical, scalable path to safer yet more helpful LLMs without costly retraining or static policy constraints.

Abstract

Safety alignment of large language models (LLMs) faces a key challenge: current alignment techniques often only focus on improving safety against harmful prompts, causing LLMs to become over-cautious and refuse to respond to benign prompts. Therefore, a key objective of safe alignment is to enhance safety while simultaneously reducing false refusals. In this paper, we introduce Energy-Driven Steering (EDS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We trained a lightweight, external Energy-Based Model (EBM) to assign high energy to undesirable (false refusal or jailbreak) states and low energy to desirable (helpful response or safe reject) ones. During inference, EBM maps the LLM's internal activations to an "energy landscape". We use the gradient of the energy function to dynamically steer the LLM's hidden states to low energy regions, correcting the model to generate a desirable response in real-time without modifying its weights. This method decouples behavioral control from the model's core knowledge, offering a flexible solution with minimal computational overhead. Extensive experiments across a wide range of models show our method successfully achieves this objective: it substantially lowers false refusal rates. For example, raising compliance on the ORB-H benchmark from 57.3% to 82.6% while maintaining the baseline safety performance. Our work presents an effective paradigm for building LLMs that achieve both low false refusal rates and high safety.

Energy-Driven Steering: Reducing False Refusals in Large Language Models

TL;DR

Energy-Driven Steering (EDS) tackles the false-refusal issue in LLM safety alignment by introducing an external Energy-Based Model (EBM) that defines an energy landscape over internal activations. By training EBMs per layer with an InfoNCE objective and applying real-time gradient-based steering during inference, EDS lowers energy for desirable trajectories and raises it for undesirable ones, steering the model toward helpful outputs without weight updates. The approach achieves substantial reductions in false refusals while preserving safety benchmarks and general capabilities, outperforming both fine-tuning-free and fine-tuning methods in key metrics and demonstrating robustness in multi-turn settings with efficient inference overhead. This provides a practical, scalable path to safer yet more helpful LLMs without costly retraining or static policy constraints.

Abstract

Safety alignment of large language models (LLMs) faces a key challenge: current alignment techniques often only focus on improving safety against harmful prompts, causing LLMs to become over-cautious and refuse to respond to benign prompts. Therefore, a key objective of safe alignment is to enhance safety while simultaneously reducing false refusals. In this paper, we introduce Energy-Driven Steering (EDS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We trained a lightweight, external Energy-Based Model (EBM) to assign high energy to undesirable (false refusal or jailbreak) states and low energy to desirable (helpful response or safe reject) ones. During inference, EBM maps the LLM's internal activations to an "energy landscape". We use the gradient of the energy function to dynamically steer the LLM's hidden states to low energy regions, correcting the model to generate a desirable response in real-time without modifying its weights. This method decouples behavioral control from the model's core knowledge, offering a flexible solution with minimal computational overhead. Extensive experiments across a wide range of models show our method successfully achieves this objective: it substantially lowers false refusal rates. For example, raising compliance on the ORB-H benchmark from 57.3% to 82.6% while maintaining the baseline safety performance. Our work presents an effective paradigm for building LLMs that achieve both low false refusal rates and high safety.

Paper Structure

This paper contains 41 sections, 3 theorems, 39 equations, 5 figures, 5 tables, 1 algorithm.

Key Result

Lemma C.1

Minimizing the InfoNCE loss (Equation eq:infonce_appendix_formal) trains the energy function $E_\theta(h)$ to assign lower energy values to hidden states from desirable trajectories ($\mathcal{D}_{\text{good}}$) and higher energy values to hidden states from undesirable trajectories ($\mathcal{D}_{\

Figures (5)

  • Figure 1: Comparison of existing LLM alignment strategies. (1) Fine-tuning methods (e.g., SFT, RLHF) modify parameters but suffer from high compute costs, long training times, and poor generalization. (2) Fine-tuning free methods (e.g., promp-driven, output filtering, activation steering) avoid retraining yet lack precision and effective steering capability. Energy-Driven Steering, offers the combined advantages of deployment flexibility, precise discrimination, and effective steering, compared with fine-tuning and fine-tuning free methods.
  • Figure 2: Overview of the Energy-Driven Steering framework. The method involves (1) gathering 'good' and 'bad' hidden state activations from a base LLM , (2) training an Energy-Based Model (EBM) to create an energy landscape that separates them , and (3) using this EBM to perform real-time, gradient-based steering to guide the model away from refusal-prone states during inference.
  • Figure 3: Robustness analysis on multi-turn jailbreak benchmarks. (a) Attack Success Rate (ASR) on the X-Teaming benchmark, evaluating the transferability of different methods against multi-turn attacks. Lower ASR is better.(b) Safety performance on the SafeDialBench benchmark, measuring the models' ability to identify unsafe content in multi-turn dialogues. The score is based on GPT-4's judgment, where a higher score indicates better identification capability.
  • Figure 4: Ablation studies on key hyperparameters for EBM steering with the Llama-3.1-8B-IT model. The plots show how performance on Llama 3.1 8-B IT when running ORB-H CR (%), JBB CR (%), and MMLU Acc (%) varies with changes to: (a) The number of layers selected for intervention. (b) The steering coefficient ($\eta$) . (c) The number of gradient descent steps per token.
  • Figure 5: Qualitative comparison of decision boundaries for classifying LLM hidden states. t-SNE visualizations show harmful (red) and harmless (blue) hidden state activations from Qwen3-14B. (Left) Vector Ablation yields a simple linear boundary that poorly separates the clusters. (Right) Our Energy-Based Model (EBM) learns a complex, non-linear boundary (where the energy gradient vanishes), accurately contouring and separating the clusters. This highlights the EBM’s superior discriminative power over linear methods. Boundaries are algorithmically generated by each method.

Theorems & Definitions (10)

  • Definition C.1: Energy Function
  • Definition C.2: Optimal Energy Function
  • Lemma C.1: Energy Landscape Property
  • proof
  • Definition C.3: State Probability Density
  • Definition C.4: Energy Gradient
  • Theorem C.1: Energy Minimization via Gradient-Based Steering
  • proof
  • Corollary C.1: Steering towards Compliance by Mitigating False Refusals
  • proof : Proof of Corollary