Table of Contents
Fetching ...

Inference-time Alignment in Continuous Space

Yige Yuan, Teng Xiao, Li Yunfan, Bingbing Xu, Shuchang Tao, Yunqi Qiu, Huawei Shen, Xueqi Cheng

TL;DR

Inference-time alignment has largely relied on discrete search over candidate responses guided by a reward model, which underperforms when the base policy is weak or the candidate set is small. SEA reframes this as continuous optimization by defining an energy function $E(oldsymbol{x},oldsymbol{y}) = \log \pi_{\mathrm{ref}}(\boldsymbol{y} \mid \boldsymbol{x}) + \alpha r(\boldsymbol{x}, \boldsymbol{y})$ and performing Langevin dynamics in the continuous logit space to steer the base policy toward high-reward regions. By initializing from the reference model and using continuous logits with differentiable gradient updates, SEA achieves more effective exploration than discrete BoN-style search, delivering substantial improvements on AdvBench, TruthfulQA, GSM8K, and MATH across multiple base models. The method demonstrates strong safety and reasoning gains, addresses shallow alignment vulnerabilities, and remains competitive in time/memory efficiency, highlighting the potential of continuous optimization for inference-time alignment. The code is publicly available, underscoring SEA’s practicality for real-world deployment and further research.

Abstract

Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility. Existing methods rely on generating multiple responses from the base policy for search using a reward model, which can be considered as searching in a discrete response space. However, these methods struggle to explore informative candidates when the base policy is weak or the candidate set is small, resulting in limited effectiveness. In this paper, to address this problem, we propose Simple Energy Adaptation ($\textbf{SEA}$), a simple yet effective algorithm for inference-time alignment. In contrast to expensive search over the discrete space, SEA directly adapts original responses from the base policy toward the optimal one via gradient-based sampling in continuous latent space. Specifically, SEA formulates inference as an iterative optimization procedure on an energy function over actions in the continuous space defined by the optimal policy, enabling simple and effective alignment. For instance, despite its simplicity, SEA outperforms the second-best baseline with a relative improvement of up to $ \textbf{77.51%}$ on AdvBench and $\textbf{16.36%}$ on MATH. Our code is publicly available at https://github.com/yuanyige/sea

Inference-time Alignment in Continuous Space

TL;DR

Inference-time alignment has largely relied on discrete search over candidate responses guided by a reward model, which underperforms when the base policy is weak or the candidate set is small. SEA reframes this as continuous optimization by defining an energy function and performing Langevin dynamics in the continuous logit space to steer the base policy toward high-reward regions. By initializing from the reference model and using continuous logits with differentiable gradient updates, SEA achieves more effective exploration than discrete BoN-style search, delivering substantial improvements on AdvBench, TruthfulQA, GSM8K, and MATH across multiple base models. The method demonstrates strong safety and reasoning gains, addresses shallow alignment vulnerabilities, and remains competitive in time/memory efficiency, highlighting the potential of continuous optimization for inference-time alignment. The code is publicly available, underscoring SEA’s practicality for real-world deployment and further research.

Abstract

Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility. Existing methods rely on generating multiple responses from the base policy for search using a reward model, which can be considered as searching in a discrete response space. However, these methods struggle to explore informative candidates when the base policy is weak or the candidate set is small, resulting in limited effectiveness. In this paper, to address this problem, we propose Simple Energy Adaptation (), a simple yet effective algorithm for inference-time alignment. In contrast to expensive search over the discrete space, SEA directly adapts original responses from the base policy toward the optimal one via gradient-based sampling in continuous latent space. Specifically, SEA formulates inference as an iterative optimization procedure on an energy function over actions in the continuous space defined by the optimal policy, enabling simple and effective alignment. For instance, despite its simplicity, SEA outperforms the second-best baseline with a relative improvement of up to on AdvBench and on MATH. Our code is publicly available at https://github.com/yuanyige/sea

Paper Structure

This paper contains 57 sections, 7 equations, 7 figures, 19 tables, 1 algorithm.

Figures (7)

  • Figure 1: Reward Model Landscape: purple (low reward) to yellow (high reward). Base Model Landscape: white (low probability) to blue (high probability). Search-Based Method: selects from base model candidates (blue points), the chosen one often far from the optimal reward. Our method SEA: black arrows trace the optimization trajectory of initial response along reward gradient, reaching the final response near the optimal region.
  • Figure 2: (a) The Best-of-N sampling faces restrictions in the rewards it can explore, due to both the capability of the base model and the size $N$ of the candidate set. (b) The weaker the ability of the base model, the lower the probability of good responses, and the more exponentially growing $N$ is needed in Best-of-N sampling to generate such a good response. (c)SEA outperforms the Best-of-N sampling with a large $N=64$, across all three tasks of safety, truthfulness, and reasoning, in both reward exploration and specific task metrics.
  • Figure 3: Overview of SEA. SEA defines the RLHF optimal distribution as an energy function and applies Langevin Dynamics in the continuous logit space $\{{\mathbf{y}}^{(n)}_{i}\}_{i=1}^L$ . The procedure starts with a soft sequence as an initial sample from an initial energy-based distribution and iteratively adapts it through gradient-based optimization. The resulting sample is approximately a sample from the desired RLHF optimal distribution.
  • Figure 4: (a) Evolution of KL divergence between optimized and initial responses, across all token positions over iterations. Iterations are denoted by colors ranging from black (start) to yellow (end). (b) Changes of KL divergence over iterations at three positions: Position 1 (first), Position 33 (middle) and Position 49 (last). (c) Changes of Top-5 tokens with the largest probability increases and decreases across entire vocabulary at three positions, with safe (unsafe) tokens colored in red (blue).
  • Figure 5: Time/Memory Efficiency and Effectiveness
  • ...and 2 more figures