Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

Chenliang Li; Siliang Zeng; Zeyi Liao; Jiaxiang Li; Dongyeop Kang; Alfredo Garcia; Mingyi Hong

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

Chenliang Li, Siliang Zeng, Zeyi Liao, Jiaxiang Li, Dongyeop Kang, Alfredo Garcia, Mingyi Hong

TL;DR

This work tackles the alignment problem by proposing Alignment with Integrated Human Feedback (AIHF), a single-stage framework that jointly learns rewards and policies from both demonstrations and preferences. By formulating AIHF as a bi-level optimization, the authors unify supervised demonstration data with human preferences, providing a finite-time convergence guarantee and showing that the resulting reward and policy are consistent with all data sources. Special cases of AIHF recover RLHF, DPO, and self-play variants, while experiments on LLM alignment and MuJoCo robotics demonstrate substantial performance gains, especially when preference data are limited or demonstrations are abundant but imperfect. The approach reduces distribution mismatch and data under-utilization inherent in multi-stage pipelines, offering a principled, data-efficient path toward better-aligned AI systems with practical impact for both language models and embodied agents.

Abstract

Aligning human preference and value is an important requirement for building contemporary foundation models and embodied AI. However, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into successive stages, such as supervised fine-tuning (SFT), reward modeling (RM), and reinforcement learning (RL), each performing one specific learning task. Such a sequential approach results in serious issues such as significant under-utilization of data and distribution mismatch between the learned reward model and generated policy, which eventually lead to poor alignment performance. We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF), capable of integrating both human preference and demonstration to train reward models and the policy. The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms such as RLHF and Directly Policy Optimization (DPO), and only requires minor changes to the existing alignment pipelines. We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo. We observe that the proposed solutions outperform the existing alignment algorithms such as RLHF and DPO by large margins, especially when the amount of high-quality preference data is relatively limited.

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

TL;DR

Abstract

Paper Structure (35 sections, 5 theorems, 114 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 35 sections, 5 theorems, 114 equations, 9 figures, 6 tables, 1 algorithm.

Introduction
Preliminaries and Related Work
Notation
The RLHF Pipeline
Reward Learning using Demonstration Data
Joint Learning from demonstration and preference
Other Approaches to Alignment
Alignment with Integrated Human Feedback (AIHF)
A Meta-Formulation
Specification of AIHF
Special Cases of AIHF
Why AIHF can outperform two-stage alignment approaches?
Proposed Algorithm for AIHF Training
Experiments
Conclusion
...and 20 more sections

Key Result

Lemma 4.1

Suppose that $L_1$ takes the form of the objective def:ml for reward learning from demonstrations, and suppose that $L_3$ takes the form eq:L3 with $c(\cdot)$ being the KL-divergence w.r.t. some initial policy $\pi^0$. Then we have the following expression: where $\pi_{\theta}$ is the optimal policy given the reward model parameterized by $\theta$, with the expression eq:opt:policy.

Figures (9)

Figure 1: Comparison of the RLHF (left) with the proposed AIHF (right).
Figure 2: Experiment results of Pythia-160M/1B/2.8B policy models, with the reward model trained from Pythia-1.4B. We record the average scores (across three trials) of AIHF and RLHF on the Anthropic-HH test dataset (See Tab. \ref{['tab:policy_quality']} in Appendix for more comparisons with other algorithms)
Figure 3: Experiment results on Pythia-1B policy models, where the reward model is trained from Pythia-1.4B models. We record the average scores of AIHF and RLHF on the Anthropic-HH test dataset, reporting the results across three different trials.
Figure 4: Performance comparison between Direct AIHF, Self-Play AIHF training across the six benchmark datasets (See also Table \ref{['tab:leaderboard']} in the Appendix).
Figure 5: Top-Left: Hopper Environment;Top-Right: HalfCheetah Environment;Bottom: Walker2d Environment; AIHF (orange) vs RLHF (blue) vs IPL (purple) hejna2024inverse; results are averaged over 3 independent runs. We use 10k demonstrations and 20k preferences. The RLHF and IPL curve is initialized from a policy pre-trained by BC; the AIHF from a random policy. The performance is compared against the # of SAC steps performed (for AIHF each policy alignment performs 5k steps of SAC.)
...and 4 more figures

Theorems & Definitions (5)

Lemma 4.1
Theorem 4.1
Lemma A.1
Lemma A.2
Lemma A.3

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

TL;DR

Abstract

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (5)