$i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization

Long Tan Le; Han Shu; Tung-Anh Nguyen; Choong Seon Hong; Nguyen H. Tran

$i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization

Long Tan Le, Han Shu, Tung-Anh Nguyen, Choong Seon Hong, Nguyen H. Tran

TL;DR

A novel LLM alignment framework named iREPO is proposed, which utilizes implicit Reward pairwise difference regression for Empirical Preference Optimization and introduces an innovative algorithm backed by theoretical guarantees for achieving optimal results under ideal assumptions and providing a practical performance-gap result without such assumptions.

Abstract

While astonishingly capable, large Language Models (LLM) can sometimes produce outputs that deviate from human expectations. Such deviations necessitate an alignment phase to prevent disseminating untruthful, toxic, or biased information. Traditional alignment methods based on reinforcement learning often struggle with the identified instability, whereas preference optimization methods are limited by their overfitting to pre-collected hard-label datasets. In this paper, we propose a novel LLM alignment framework named $i$REPO, which utilizes implicit Reward pairwise difference regression for Empirical Preference Optimization. Particularly, $i$REPO employs self-generated datasets labeled by empirical human (or AI annotator) preference to iteratively refine the aligned policy through a novel regression-based loss function. Furthermore, we introduce an innovative algorithm backed by theoretical guarantees for achieving optimal results under ideal assumptions and providing a practical performance-gap result without such assumptions. Experimental results with Phi-2 and Mistral-7B demonstrate that $i$REPO effectively achieves self-alignment using soft-label, self-generated responses and the logit of empirical AI annotators. Furthermore, our approach surpasses preference optimization baselines in evaluations using the Language Model Evaluation Harness and Multi-turn benchmarks.

$i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization

TL;DR

Abstract

REPO, which utilizes implicit Reward pairwise difference regression for Empirical Preference Optimization. Particularly,

REPO employs self-generated datasets labeled by empirical human (or AI annotator) preference to iteratively refine the aligned policy through a novel regression-based loss function. Furthermore, we introduce an innovative algorithm backed by theoretical guarantees for achieving optimal results under ideal assumptions and providing a practical performance-gap result without such assumptions. Experimental results with Phi-2 and Mistral-7B demonstrate that

REPO effectively achieves self-alignment using soft-label, self-generated responses and the logit of empirical AI annotators. Furthermore, our approach surpasses preference optimization baselines in evaluations using the Language Model Evaluation Harness and Multi-turn benchmarks.

Paper Structure (29 sections, 2 theorems, 34 equations, 4 figures, 11 tables, 1 algorithm)

This paper contains 29 sections, 2 theorems, 34 equations, 4 figures, 11 tables, 1 algorithm.

Introduction
Related work
Preliminaries
RLHF with Explicit Reward Models
RLHF with Implicit Reward Models
$i$mplicit Reward pairwise based Empirical Preference Optimization ($i\textsc{REPO}$)
Empirical Human Preference Model
$i\textsc{REPO}$: Algorithm
$i\textsc{REPO}$: Theoretical Results
Experiments
Experimental Setting
Main Results
Ablation Studies
Conclusion
Proof of \ref{['lem:perfect']}
...and 14 more sections

Key Result

Lemma 4.4

With Assumptions asmp:Realizability, asmp:population,and asmp:distribution_gap, and denote $\theta^{(\tau^{\star})}$ a solution to Then $\pi_{\theta^{(\tau^{\star})}}$ is a policy that generates responses aligned with the population human preference ${\mathcal{P}}^*$ in expectation of a total variance distance as follows Furthermore, $\pi_{\theta^{(\tau^{\star})}}$ is also an optimal policy of t

Figures (4)

Figure 1: MT-Bench single-grading evaluation for Phi-2 and Mistral-7B models with different methods.
Figure 2: (a) Performance of $i\textsc{REPO}$ with and without the logit of empirical human preference, and (b) Performance of $i\textsc{REPO}$ with different number of AI annotators
Figure 3: Preference classification accuracy of LLM rankers on Ultrafeedback-Binarized dataset.
Figure :

Theorems & Definitions (5)

Lemma 4.4
Theorem 4.5
Remark 4.6
proof
proof

$i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization

TL;DR

Abstract

$i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (5)