Table of Contents
Fetching ...

Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning

Jiahao Zhang, Lujing Zhang, Keltin Grimes, Zhuohao Yu, Gokul Swamy, Zhiwei Steven Wu

TL;DR

This work derives a novel, game-theoretic solution concept -- the $\textit{Maximum Entropy Blackwell Winner}$ ($\textit{MaxEntBW}$) -- that is well-defined under multi-objective intransitive preferences and applies it to the problem of fine-tuning large language models from multi-objective LLM-as-a-Judge feedback.

Abstract

A recurring challenge in preference fine-tuning (PFT) is handling $\textit{intransitive}$ (i.e., cyclic) preferences. Intransitive preferences often stem from either $\textit{(i)}$ inconsistent rankings along a single objective or $\textit{(ii)}$ scalarizing multiple objectives into a single metric. Regardless of their source, the downstream implication of intransitive preferences is the same: there is no well-defined optimal policy, breaking a core assumption of the standard PFT pipeline. In response, we propose a novel, game-theoretic solution concept -- the $\textit{Maximum Entropy Blackwell Winner}$ ($\textit{MaxEntBW}$) -- that is well-defined under multi-objective intransitive preferences. To enable computing MaxEntBWs at scale, we derive $\texttt{PROSPER}$: a provably efficient PFT algorithm. Unlike prior self-play techniques, $\texttt{PROSPER}$ directly handles multiple objectives without requiring scalarization. We then apply $\texttt{PROSPER}$ to the problem of fine-tuning large language models (LLMs) from multi-objective LLM-as-a-Judge feedback (e.g., rubric-based judges), a setting where both sources of intransitivity arise. We find that $\texttt{PROSPER}$ outperforms all baselines considered across both instruction following and general chat benchmarks, releasing trained model checkpoints at the 7B and 3B parameter scales.

Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning

TL;DR

This work derives a novel, game-theoretic solution concept -- the () -- that is well-defined under multi-objective intransitive preferences and applies it to the problem of fine-tuning large language models from multi-objective LLM-as-a-Judge feedback.

Abstract

A recurring challenge in preference fine-tuning (PFT) is handling (i.e., cyclic) preferences. Intransitive preferences often stem from either inconsistent rankings along a single objective or scalarizing multiple objectives into a single metric. Regardless of their source, the downstream implication of intransitive preferences is the same: there is no well-defined optimal policy, breaking a core assumption of the standard PFT pipeline. In response, we propose a novel, game-theoretic solution concept -- the () -- that is well-defined under multi-objective intransitive preferences. To enable computing MaxEntBWs at scale, we derive : a provably efficient PFT algorithm. Unlike prior self-play techniques, directly handles multiple objectives without requiring scalarization. We then apply to the problem of fine-tuning large language models (LLMs) from multi-objective LLM-as-a-Judge feedback (e.g., rubric-based judges), a setting where both sources of intransitivity arise. We find that outperforms all baselines considered across both instruction following and general chat benchmarks, releasing trained model checkpoints at the 7B and 3B parameter scales.
Paper Structure (24 sections, 9 theorems, 56 equations, 5 figures, 7 tables)

This paper contains 24 sections, 9 theorems, 56 equations, 5 figures, 7 tables.

Key Result

Theorem 4.2

Under ass: regression, after $T$ iterations, there exists some $\hat{\pi} \in \{\pi_{\theta_1},\cdots,\pi_{\theta_T}\}$ such that $V(\pi^\star)-V(\hat{\pi})\le O{ \left ( \sqrt{1/T}+\sqrt{C_{\pi_{\mathsf{ref}}\rightarrow\pi^\star}\epsilon} \right ) }$, where $\pi^\star\in\mathop{\mathrm{arg\,max}}

Figures (5)

  • Figure 1: We study the problem of learning from multiple preferences over different objectives or criteria, each of which might be intransitive (i.e., inconsistent). Such preferences are common when learning from large language model (LLM) "judges" zheng2023judging that evaluate a response along multiple objectives. We propose a novel solution concept for multi-object preference fine-tuning (PFT), the Maximum Entropy Blackwell Winner (MaxEntBW), that remains well-defined under multiple intransitive preferences. We then derive PROSPER: a regression-based algorithm for computing MaxEntBWs, before using it to fine-tune LLMs from multi-objective LLM judge feedback on multiple problems.
  • Figure 2: On a held-out set of prompts from WildChecklists, we report the fraction of prompts with no Condorcet Winner (left) and Intransitive Preferences (right) for both the $\mathcal{P}_{\mathsf{JC}}$ (joint check) and $\mathcal{P}_{\mathsf{SC}}$ (single check) judges, where $N$ denotes the number of generated responses. We see that splitting up a rubric into multiple items before passing it to the LLM judge (i.e., using $\mathcal{P}_{\mathsf{SC}}$ rather than $\mathcal{P}_{\mathsf{JC}}$) reduces but doesn't eliminate inconsistent preferences.
  • Figure 3: We consistently see policies trained via PROSPER outperform RLCF (roughly $2/3$ of the time), baseline (roughly $3/4$ of the time), and ablation method policies, as measured by LLM judge win-rates on held-out prompts. This indicates that PROSPER is able to more effectively optimize nuanced, multi-criteria LLM judge preferences than the other methods we consider.
  • Figure 4: Prompt for generating preference score according to specific criteria.
  • Figure 5: Prompt for generating preference score according to multiple criterion.

Theorems & Definitions (16)

  • Definition 3.1: Maximum Entropy Blackwell Winner (MaxEntBW)
  • Theorem 4.2
  • Lemma 2.1
  • proof
  • Theorem 2.2: Bauer's maximum principle, concave form
  • Lemma 2.3
  • Lemma 2.4
  • proof
  • Lemma 2.5
  • proof
  • ...and 6 more