On a Connection Between Imitation Learning and RLHF
Teng Xiao, Yige Yuan, Mingxiao Li, Zhengyu Chen, Vasant G Honavar
TL;DR
By revealing that RLHF effectively performs imitation learning on the chosen-response distribution under a reverse KL objective, the paper unifies RLHF and imitation learning. It then introduces Direct Imitation Learning (DIL), a principled framework that directly optimizes a density-ratio reward without explicit reward modeling or RL loops, leveraging Bregman-divergence–based density ratio estimation. DIL subsumes existing alignment methods like DPO and demonstrates strong empirical gains on Open LLM Leaderboard benchmarks and human-preference–driven tasks (Reddit TL;DR, Anthropic-HH). The results suggest that avoiding BT assumptions and directly optimizing the imitation objective yields better preservation of reasoning abilities and closer alignment with human preferences.
Abstract
This work studies the alignment of large language models with preference data from an imitation learning perspective. We establish a close theoretical connection between reinforcement learning from human feedback RLHF and imitation learning (IL), revealing that RLHF implicitly performs imitation learning on the preference data distribution. Building on this connection, we propose DIL, a principled framework that directly optimizes the imitation learning objective. DIL provides a unified imitation learning perspective on alignment, encompassing existing alignment algorithms as special cases while naturally introducing new variants. By bridging IL and RLHF, DIL offers new insights into alignment with RLHF. Extensive experiments demonstrate that DIL outperforms existing methods on various challenging benchmarks.
