Supervised Fine-Tuning as Inverse Reinforcement Learning
Hao Sun
TL;DR
This work reframes LLM alignment from a heavy reliance on preference datasets to a demonstration-based, RL-inspired approach, treating auto-regressive generation as a sequential decision process and analyzing how different divergences shape learning. It connects supervised fine-tuning to forward KL trajectory matching, clarifying the mass-covering bias of SFT and outlining how reverse KL and Jensen-Shannon divergences can produce mode-seeking behavior through adversarial imitation. By contrast, it highlights the potential of imitation learning with offline feedback and adversarial training to excel in low-data or open-ended scenarios, and it critiques the Bradley-Terry-based reward modeling underpinning some RLHF variants like DPO. The practical implication is a principled guide for selecting alignment objectives and data formats that suit data availability, privacy constraints, and task openness, enabling more robust LLM alignment with demonstrations rather than relying solely on preference datasets.
Abstract
The prevailing approach to aligning Large Language Models (LLMs) typically relies on human or AI feedback and assumes access to specific types of preference datasets. In our work, we question the efficacy of such datasets and explore various scenarios where alignment with expert demonstrations proves more realistic. We build a sequential decision-making framework to formulate the problem of aligning LLMs using demonstration datasets. Drawing insights from inverse reinforcement learning and imitation learning, we introduce various approaches for divergence minimization in the LLM alignment tasks. Our analysis highlights the mass-covering and mode-seeking behaviors of these different approaches. Inclusively, we examine the pros and cons of the classical supervised fine-tuning method, elaborating on scenarios where different methods shine.
