REFA: Reference Free Alignment for multi-preference optimization
Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan
TL;DR
REFA addresses the URSLA shortcut, a length-based reward-hacking phenomenon in reference-free, multi-preference alignment, by introducing EOS-probability regularizers that intervene directly in termination. It combines a robust REFA-Dynamic contrastive loss with targeted EOS regularization and twobudgeted variants to enable explicit control over output length and token budgets. The approach yields state-of-the-art results on open benchmarks (e.g., a 60.29% win rate on AlpacaEval2 with Llama-3-8B-Instruct and 52.17% length-controlled win rate) and offers a practical toolkit for aligning models under inference-time budgets without relying on a fixed reference policy. By connecting alignment quality with deployment efficiency, REFA provides a principled, tunable framework for robust, budget-aware generation in real-world applications.
Abstract
To mitigate reward hacking from response verbosity, modern preference optimization methods are increasingly adopting length normalization (e.g., SimPO, ORPO, LN-DPO). While effective against this bias, we demonstrate that length normalization itself introduces a failure mode: the URSLA shortcut. Here models learn to satisfy the alignment objective by prematurely truncating low-quality responses rather than learning from their semantic content. To address this, we introduce REFA, a new alignment framework that proposes probabilistic control on a structural token that controls termination. Our core innovation is a new class of regularizers that operate directly on the probability of the End-of-Sequence (EOS) token, a previously unexploited control lever. This token-level intervention provides a principled solution to the URSLA shortcut, ensuring genuine quality improvements. Furthermore, it unlocks a versatile mechanism for managing the alignment-efficiency tradeoff, enabling practitioners to fine-tune models that adhere to specific token budgets. Empirically, REFA achieves a 60.29% win rate and a 52.17% length-controlled win rate on AlpacaEval2 with Llama-3-8B-Instruct, demonstrating the power of our token-level control paradigm.
