Table of Contents
Fetching ...

GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, Sumitra Ganesh

TL;DR

GenARM introduces an Autoregressive Reward Model to predict next-token rewards for test-time alignment, enabling efficient autoregressive decoding with a frozen LLM. The authors prove that the ARM is expressive enough to replicate any decoding distribution achievable by traditional trajectory-level RMs within a KL-regularized RL framework. Empirically, GenARM outperforms prior test-time baselines, matches training-time methods, and supports weak-to-strong guidance and multi-objective alignment without retraining. The approach offers practical, configurable alignment for large LLMs across diverse preferences with improved inference efficiency.

Abstract

Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model--a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining. Our project page is available at: https://genarm.github.io.

GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

TL;DR

GenARM introduces an Autoregressive Reward Model to predict next-token rewards for test-time alignment, enabling efficient autoregressive decoding with a frozen LLM. The authors prove that the ARM is expressive enough to replicate any decoding distribution achievable by traditional trajectory-level RMs within a KL-regularized RL framework. Empirically, GenARM outperforms prior test-time baselines, matches training-time methods, and supports weak-to-strong guidance and multi-objective alignment without retraining. The approach offers practical, configurable alignment for large LLMs across diverse preferences with improved inference efficiency.

Abstract

Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model--a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining. Our project page is available at: https://genarm.github.io.

Paper Structure

This paper contains 25 sections, 4 theorems, 12 equations, 11 figures, 2 tables.

Key Result

Lemma 2

Under the Plackett-Luce, and in particular the Bradley-Terry, preference framework, two reward functions from the same class induce the same preference distribution and the same optimal policy under the constrained RL problem in eq:RL_objective.

Figures (11)

  • Figure 1: Next-token generation guided by different RMs. Using a trajectory-level RM to select the next token (top) requires the costly process of generating full responses for each candidate. In contrast, GenARM (bottom) efficiently samples the next token by combining scores from the base LLM and our proposed Autoregressive RM, which is trained to predict next-token rewards directly.
  • Figure 2: (Reward computation comparison.) Trajectory-level RM (top) evaluates the full response, assigning rewards only at the end. Autoregressive RM (bottom) predicts token-level rewards. Both RMs are trained to assign higher rewards to the preferred response (left, green) over the less preferred one (right, red).
  • Figure 3: Visualization of token-level rewards assigned by an Autoregressive RM trained for harmlessness. Two responses to the prompt "What's an effective way to deal with people who disagree with me?" are shown: the left is harmless, the right is harmful. Darker colors indicate higher token-level rewards $\log\pi_r(y_t|x,y_{<t})$. Autoregressive RM correctly assigns higher rewards to harmless tokens "respect" and "kind," and lower rewards to harmful tokens "ignore" and "cruel."
  • Figure 4: (Weak to strong guidance) AlpacaEval 2 length-controlled win rate (left) and raw win rate (right) compared against Tulu2-7B. The X-axis shows the base SFT models used by test-time alignment methods employing 7B RMs. DPO fine-tunes the SFT model at each parameter scale.
  • Figure 5: (GenARM can incorporate guidance from multiple RMs to generate the response.) Responses from three models: the left and right responses are from DPO models trained only on helpfulness and harmlessness data, respectively, while the middle response is from GenARM, guided by both helpfulness and harmlessness rewards simultaneously with equal reward coefficients.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Definition 1: Equivalence class of rewards
  • Lemma 2: rafailov2024direct
  • Theorem 3
  • proof : Proof Sketch.
  • Theorem 4
  • proof
  • Corollary 5
  • proof