Table of Contents
Fetching ...

Multi-Reference Preference Optimization for Large Language Models

Hung Le, Quan Tran, Dung Nguyen, Kien Do, Saloni Mittal, Kelechi Ogueji, Svetha Venkatesh

TL;DR

MRPO introduces a principled, closed-form direct preference optimization framework that integrates multiple reference LLMs to guide fine-tuning toward human preferences while respecting a reference-informed constraint. By deriving a surrogate objective with a virtual aggregated reference $\tilde{\\pi}_{ref}$ and a clipping mechanism (CTRO) around a primary reference, MRPO achieves stable training and improved generalization across scarce and large preference datasets. The Adaptive Reference Weighting Coefficients (ARWC) automatically balance the influence of each reference, and theoretical analyses show favorable gradient behavior compared to naive multi-DPO approaches. Empirically, MRPO yields consistent gains on preference-learning tasks and general language understanding benchmarks, including GSM8K and TruthfulQA, with additional benefits in distillation to smaller models. These results demonstrate MRPO’s practical potential for scalable, multi-reference alignment of LLMs with human values and preferences.

Abstract

How can Large Language Models (LLMs) be aligned with human intentions and values? A typical solution is to gather human preference on model outputs and finetune the LLMs accordingly while ensuring that updates do not deviate too far from a reference model. Recent approaches, such as direct preference optimization (DPO), have eliminated the need for unstable and sluggish reinforcement learning optimization by introducing close-formed supervised losses. However, a significant limitation of the current approach is its design for a single reference model only, neglecting to leverage the collective power of numerous pretrained LLMs. To overcome this limitation, we introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models, substantially enhancing preference learning capabilities compared to the single-reference DPO. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance. Furthermore, MRPO effectively finetunes LLMs to exhibit superior performance in several downstream natural language processing tasks such as GSM8K and TruthfulQA.

Multi-Reference Preference Optimization for Large Language Models

TL;DR

MRPO introduces a principled, closed-form direct preference optimization framework that integrates multiple reference LLMs to guide fine-tuning toward human preferences while respecting a reference-informed constraint. By deriving a surrogate objective with a virtual aggregated reference and a clipping mechanism (CTRO) around a primary reference, MRPO achieves stable training and improved generalization across scarce and large preference datasets. The Adaptive Reference Weighting Coefficients (ARWC) automatically balance the influence of each reference, and theoretical analyses show favorable gradient behavior compared to naive multi-DPO approaches. Empirically, MRPO yields consistent gains on preference-learning tasks and general language understanding benchmarks, including GSM8K and TruthfulQA, with additional benefits in distillation to smaller models. These results demonstrate MRPO’s practical potential for scalable, multi-reference alignment of LLMs with human values and preferences.

Abstract

How can Large Language Models (LLMs) be aligned with human intentions and values? A typical solution is to gather human preference on model outputs and finetune the LLMs accordingly while ensuring that updates do not deviate too far from a reference model. Recent approaches, such as direct preference optimization (DPO), have eliminated the need for unstable and sluggish reinforcement learning optimization by introducing close-formed supervised losses. However, a significant limitation of the current approach is its design for a single reference model only, neglecting to leverage the collective power of numerous pretrained LLMs. To overcome this limitation, we introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models, substantially enhancing preference learning capabilities compared to the single-reference DPO. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance. Furthermore, MRPO effectively finetunes LLMs to exhibit superior performance in several downstream natural language processing tasks such as GSM8K and TruthfulQA.
Paper Structure (26 sections, 2 theorems, 27 equations, 4 figures, 8 tables)

This paper contains 26 sections, 2 theorems, 27 equations, 4 figures, 8 tables.

Key Result

Proposition 1

The following policy is the optimum for a lower bound of the RLHF objective (Eq. eq:prefm-rlhf-1): where $\tilde{\pi}_{ref}(y|x)=\left(\sum_{k=1}^{K}\frac{\alpha_{k}}{\pi_{ref}^{k}\left(y|x\right)}\right)^{-1}$ and $Z\left(x\right)=\sum_{y}\tilde{\pi}_{ref}(y|x)\exp\left(\frac{1}{\beta}r\left(x,y\right)\right)$.

Figures (4)

  • Figure 1: Chosen/Rejection preference accuracy on 3 small datasets: S1, S2 and S3. The curves show mean and std. of preference accuracy on test sets over training batches for 5 runs. In the first row, for MRPO and Multi-DPO, RefM1 is LLama, and RefM2 is Mistral. In the second row, this order is reversed.
  • Figure 2: Chosen/Rejection preference accuracy on 3 big datasets: HelpSteer, Ultrafeedback and Nectar. The curves show the mean and std. of preference accuracy on test sets over training batches for 3 runs.
  • Figure 3: Reward Margin on 3 big datasets: HelpSteer, Ultrafeedback and Nectar. The curves show the mean and std. of reward margin on test sets over training batches for 3 runs.
  • Figure 4: Analysis on $\epsilon$ and $\alpha$ using S1, S2 and S3 datasets. The curves show the mean and std. of testing preference accuracy over training batches for 5 runs. In the first row, adaptive $\epsilon_{max}=0.1$ is compared with fixed $\epsilon=0.1$. In the second row, adaptive $\alpha$ is compared with different fixed values of $\alpha$.

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Proposition 2
  • proof