Table of Contents
Fetching ...

Beyond Pairwise: Empowering LLM Alignment With Ranked Choice Modeling

Yuxuan Tang, Yifan Feng

TL;DR

This paper tackles the limitation of pairwise preference signals in LLM alignment by introducing Ranked Choice Preference Optimization (RCPO), a maximum-likelihood framework that accommodates ranked feedback such as top-$k$ lists. RCPO unifies existing methods (e.g., DPO) under a general choice-model perspective and demonstrates how to instantiate it with utility-based (Multinomial Logit) and rank-based (Mallows-RMJ) models. Empirical results on Llama-3-8B-Instruct and Gemma-2-9B-it across AlpacaEval 2 and Arena-Hard show RCPO variants consistently outperform baselines, with Mallows-RMJ-PO-Top-2 often delivering the strongest gains and robustness to evaluation context. The work provides a practical, extensible foundation for incorporating richer ranked feedback into LLM alignment, potentially improving factuality, usefulness, and safety in deployment.

Abstract

Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from richer forms of human feedback, such as multiwise comparisons and top-$k$ rankings. We propose Ranked Choice Preference Optimization (RCPO), a unified framework that bridges preference optimization with (ranked) choice modeling via maximum likelihood estimation. The framework is flexible, supporting both utility-based and rank-based choice models. It subsumes several existing pairwise methods (e.g., DPO, SimPO), while providing principled training objectives for richer feedback formats. We instantiate this framework with two representative ranked choice models (Multinomial Logit and Mallows-RMJ). Empirical studies on Llama-3-8B-Instruct and Gemma-2-9B-it across AlpacaEval 2 and Arena-Hard benchmarks show that RCPO consistently outperforms competitive baselines. RCPO shows how directly leveraging ranked preference data, combined with the right choice models, yields more effective alignment. It offers a versatile and extensible foundation for incorporating (ranked) choice modeling into LLM training.

Beyond Pairwise: Empowering LLM Alignment With Ranked Choice Modeling

TL;DR

This paper tackles the limitation of pairwise preference signals in LLM alignment by introducing Ranked Choice Preference Optimization (RCPO), a maximum-likelihood framework that accommodates ranked feedback such as top- lists. RCPO unifies existing methods (e.g., DPO) under a general choice-model perspective and demonstrates how to instantiate it with utility-based (Multinomial Logit) and rank-based (Mallows-RMJ) models. Empirical results on Llama-3-8B-Instruct and Gemma-2-9B-it across AlpacaEval 2 and Arena-Hard show RCPO variants consistently outperform baselines, with Mallows-RMJ-PO-Top-2 often delivering the strongest gains and robustness to evaluation context. The work provides a practical, extensible foundation for incorporating richer ranked feedback into LLM alignment, potentially improving factuality, usefulness, and safety in deployment.

Abstract

Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from richer forms of human feedback, such as multiwise comparisons and top- rankings. We propose Ranked Choice Preference Optimization (RCPO), a unified framework that bridges preference optimization with (ranked) choice modeling via maximum likelihood estimation. The framework is flexible, supporting both utility-based and rank-based choice models. It subsumes several existing pairwise methods (e.g., DPO, SimPO), while providing principled training objectives for richer feedback formats. We instantiate this framework with two representative ranked choice models (Multinomial Logit and Mallows-RMJ). Empirical studies on Llama-3-8B-Instruct and Gemma-2-9B-it across AlpacaEval 2 and Arena-Hard benchmarks show that RCPO consistently outperforms competitive baselines. RCPO shows how directly leveraging ranked preference data, combined with the right choice models, yields more effective alignment. It offers a versatile and extensible foundation for incorporating (ranked) choice modeling into LLM training.

Paper Structure

This paper contains 47 sections, 5 theorems, 49 equations, 2 figures, 11 tables.

Key Result

Theorem 1

Suppose the underlying single-best choice preference distribution follows MNL, the corresponding policy optimization objective is given by:

Figures (2)

  • Figure 1: Ranked Choice Preference Optimization (RCPO)
  • Figure :

Theorems & Definitions (5)

  • Theorem 1: MNL-PO-Discrete
  • Theorem 2: MNL-PO-Top-k
  • Theorem 3: Mallows-RMJ-PO-Discrete
  • Theorem 4: Mallows-RMJ-PO-Top-k
  • Theorem 5: Mallows-RMJ-PO-Pairwise