Table of Contents
Fetching ...

REFA: Reference Free Alignment for multi-preference optimization

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan

TL;DR

REFA addresses the URSLA shortcut, a length-based reward-hacking phenomenon in reference-free, multi-preference alignment, by introducing EOS-probability regularizers that intervene directly in termination. It combines a robust REFA-Dynamic contrastive loss with targeted EOS regularization and twobudgeted variants to enable explicit control over output length and token budgets. The approach yields state-of-the-art results on open benchmarks (e.g., a 60.29% win rate on AlpacaEval2 with Llama-3-8B-Instruct and 52.17% length-controlled win rate) and offers a practical toolkit for aligning models under inference-time budgets without relying on a fixed reference policy. By connecting alignment quality with deployment efficiency, REFA provides a principled, tunable framework for robust, budget-aware generation in real-world applications.

Abstract

To mitigate reward hacking from response verbosity, modern preference optimization methods are increasingly adopting length normalization (e.g., SimPO, ORPO, LN-DPO). While effective against this bias, we demonstrate that length normalization itself introduces a failure mode: the URSLA shortcut. Here models learn to satisfy the alignment objective by prematurely truncating low-quality responses rather than learning from their semantic content. To address this, we introduce REFA, a new alignment framework that proposes probabilistic control on a structural token that controls termination. Our core innovation is a new class of regularizers that operate directly on the probability of the End-of-Sequence (EOS) token, a previously unexploited control lever. This token-level intervention provides a principled solution to the URSLA shortcut, ensuring genuine quality improvements. Furthermore, it unlocks a versatile mechanism for managing the alignment-efficiency tradeoff, enabling practitioners to fine-tune models that adhere to specific token budgets. Empirically, REFA achieves a 60.29% win rate and a 52.17% length-controlled win rate on AlpacaEval2 with Llama-3-8B-Instruct, demonstrating the power of our token-level control paradigm.

REFA: Reference Free Alignment for multi-preference optimization

TL;DR

REFA addresses the URSLA shortcut, a length-based reward-hacking phenomenon in reference-free, multi-preference alignment, by introducing EOS-probability regularizers that intervene directly in termination. It combines a robust REFA-Dynamic contrastive loss with targeted EOS regularization and twobudgeted variants to enable explicit control over output length and token budgets. The approach yields state-of-the-art results on open benchmarks (e.g., a 60.29% win rate on AlpacaEval2 with Llama-3-8B-Instruct and 52.17% length-controlled win rate) and offers a practical toolkit for aligning models under inference-time budgets without relying on a fixed reference policy. By connecting alignment quality with deployment efficiency, REFA provides a principled, tunable framework for robust, budget-aware generation in real-world applications.

Abstract

To mitigate reward hacking from response verbosity, modern preference optimization methods are increasingly adopting length normalization (e.g., SimPO, ORPO, LN-DPO). While effective against this bias, we demonstrate that length normalization itself introduces a failure mode: the URSLA shortcut. Here models learn to satisfy the alignment objective by prematurely truncating low-quality responses rather than learning from their semantic content. To address this, we introduce REFA, a new alignment framework that proposes probabilistic control on a structural token that controls termination. Our core innovation is a new class of regularizers that operate directly on the probability of the End-of-Sequence (EOS) token, a previously unexploited control lever. This token-level intervention provides a principled solution to the URSLA shortcut, ensuring genuine quality improvements. Furthermore, it unlocks a versatile mechanism for managing the alignment-efficiency tradeoff, enabling practitioners to fine-tune models that adhere to specific token budgets. Empirically, REFA achieves a 60.29% win rate and a 52.17% length-controlled win rate on AlpacaEval2 with Llama-3-8B-Instruct, demonstrating the power of our token-level control paradigm.

Paper Structure

This paper contains 116 sections, 6 theorems, 96 equations, 17 figures, 6 tables, 2 algorithms.

Key Result

Lemma 1

Let $\ell_i(\theta)$ be defined as: where and $p_i^{\text{target}}$ is a fixed scalar satisfying $p_i^{\text{target}} > 0$. Assume $\pi_\theta(y_j|x) > 0$ for all $j=1,\dots,K$. Then, the partial derivatives of $\ell_i(\theta)$ with respect to $\pi_\theta(y_i|x)$ and $\pi_\theta(y_j|x)$ for $j \neq i$ are: and for each $j \neq i$:

Figures (17)

  • Figure 1: Empirical analysis supporting the URSLA conjecture (Conjecture \ref{['conj:ursla_main']}) on Ultrafeedback. The plot shows a decreasing trend in average per-token negative log-probability (uncertainty) as sequence length increases for both Mistral-7B and Llama-3-8B base models.
  • Figure 2: Impact of $\lambda$ on Mistral-Base (7B) Performance: a) Average Response Length, b) Length-Controlled Win Rate (LC-WR), and c) Win Rate (WR).
  • Figure 3: Analysis of token distribution in Refa: (a) across alignment methods (InfoNCA, Refa, and Simpo). (b) across regularization coefficient $\lambda$ for the Refa method.
  • Figure 4: Performance (LC-WR, WR) and Average Tokens vs. $\lambda$ for the budget-independent EOS regularizer on AlpacaEval2. Note the peak LC-WR at $\lambda \approx 0.1$.
  • Figure 5: Performance (LC-WR, WR) and Average Tokens vs. $\lambda$ for the budgeted EOS regularizer with a target budget of $b=256$ on AlpacaEval2.
  • ...and 12 more figures

Theorems & Definitions (13)

  • Conjecture 1: Uncertainty Reduction with Sequence Length Assertion (URSLA)
  • Lemma 1: Gradient of a Single-Term Objective
  • proof
  • Lemma 2: Directional Influence of a Single-Term Objective
  • proof
  • Lemma 3: Gradient of the Simplified Refa-Dynamic Loss
  • proof
  • Lemma 4: Increasing Probability of Positive Responses Decreases the Loss
  • proof
  • Lemma 5: Length-Induced Probability Inflation
  • ...and 3 more