Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment

Jiaxiang Li; Siliang Zeng; Hoi-To Wai; Chenliang Li; Alfredo Garcia; Mingyi Hong

Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment

Jiaxiang Li, Siliang Zeng, Hoi-To Wai, Chenliang Li, Alfredo Garcia, Mingyi Hong

TL;DR

This work proposes to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model, and discovers a connection between the proposed IRL based approach, and a recent line of works called Self-Play Fine-tune (SPIN).

Abstract

Aligning human preference and value is an important requirement for contemporary foundation models. State-of-the-art techniques such as Reinforcement Learning from Human Feedback (RLHF) often consist of two stages: 1) supervised fine-tuning (SFT), where the model is fine-tuned by learning from human demonstration data; 2) Preference learning, where preference data is used to learn a reward model, which is in turn used by a reinforcement learning (RL) step to fine-tune the model. Such reward model serves as a proxy to human preference, and it is critical to guide the RL step towards improving the model quality. In this work, we argue that the SFT stage significantly benefits from learning a reward model as well. Instead of using the human demonstration data directly via supervised learning, we propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model. This approach leads to new SFT algorithms that are not only efficient to implement, but are robust to the presence of low-quality supervised learning data. Moreover, we discover a connection between the proposed IRL based approach, and a recent line of works called Self-Play Fine-tune (SPIN). Theoretically, we show that the proposed algorithms converge to the stationary solutions of the IRL problem. Empirically, we align 1B and 7B models using proposed methods and evaluate them on a reward benchmark model and the HuggingFace Open LLM Leaderboard. The proposed methods show significant performance improvement over existing SFT approaches. Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.

Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment

TL;DR

Abstract

Paper Structure (16 sections, 7 theorems, 40 equations, 4 figures, 6 tables, 2 algorithms)

This paper contains 16 sections, 7 theorems, 40 equations, 4 figures, 6 tables, 2 algorithms.

Introduction
Preliminaries
Reward Learning and Policy Fine Tuning from Demonstration Data
Joint Reward-learning and Policy Fine-tuning by Inverse RL
Implicit Reward-learning Fine-tuning via Self-generation
Convergence Theory
Discussions
Numerical experiments
Experiment Setup
Results of RFT (Algorithm \ref{['algo:offline_ML_IRL']})
Results of IRFT (Algorithm \ref{['algo:self_gen_grad']})
Conclusions and Limitations
Related works
Proofs for Section \ref{['sec:main']}
Implementation details of the numerical experiments
...and 1 more sections

Key Result

Lemma 3.1

Problem eq:ME_IRL is equivalent to the following minimax optimization problem:

Figures (4)

Figure 1: Left: Difference between SFT and the two proposed methods: RFT (Algorithm \ref{['algo:offline_ML_IRL']}) and IRFT (Algorithm \ref{['algo:self_gen_grad']}); Right: Log probability gap between the chosen/preferred continuation and the rejected/non-preferred continuations for different methods. All methods only consume the chosen/preferred data, but RFT and IRFT can effectively distinguish between chosen and rejected continuations; see Example 2 in Sec. \ref{['sec:main']} for the detailed settings.
Figure 2: A state-less counter-example with three actions where IRL-based fine-tune \ref{['eq:ME_IRL']} shows regularization effect over SFT \ref{['eq:SFT']} to maintain weights over unseen data in the demonstration dataset $\mathcal{D}$. Here we assume $r\in[0, R]$.
Figure 3: Table summarizing the computational costs of proposed methods.
Figure 4: Algorithm \ref{['algo:offline_ML_IRL']} fine-tuning result of pythia-1.4b over Anthropic-HH (with top 10k data picked by PKU-Alignment/beaver-7b-v3.0-reward). We record the average score of test dataset on the left figure and the win rate of Algorithm \ref{['algo:offline_ML_IRL']} over the (full SFT) base model and the SFT model.

Theorems & Definitions (7)

Lemma 3.1
Lemma 3.2
Theorem 3.1
Lemma B.1
Lemma B.2
Lemma B.3
Theorem B.1

Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment

TL;DR

Abstract

Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (7)