Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees
Yongtao Wu, Luca Viano, Yihang Chen, Zhenyu Zhu, Kimon Antonakopoulos, Quanquan Gu, Volkan Cevher
TL;DR
This work reframes multi-step RLHF as a two-player constant-sum Markov game to capture intermediate and non-transitive human preferences, moving beyond Bradley-Terry assumptions and terminal-only signals. It introduces Optimistic Multi-step Preference Optimization (OMPO), which applies optimistic online mirror descent to the occupancy-measure formulation, yielding an $O(ε^{-1})$ policy-update convergence to an $ε$-approximate Nash equilibrium and a projection-free, practical implementation. Theoretical results are complemented by empirical validation on multi-turn conversation benchmarks (MT-bench-101) and math reasoning datasets, where OMPO and its MPO variant outperform several baselines and adapt to intermediate versus terminal rewards. The approach enables per-turn preference learning with general, non-transitive signals while providing convergence guarantees and scalable implementation, potentially impacting real-world alignment in complex dialogue and reasoning tasks.
Abstract
Reinforcement Learning from Human Feedback (RLHF) has been highly successful in aligning large language models with human preferences. While prevalent methods like DPO have demonstrated strong performance, they frame interactions with the language model as a bandit problem, which limits their applicability in real-world scenarios where multi-turn conversations are common. Additionally, DPO relies on the Bradley-Terry model assumption, which does not adequately capture the non-transitive nature of human preferences. In this paper, we address these challenges by modeling the alignment problem as a two-player constant-sum Markov game, where each player seeks to maximize their winning rate against the other across all steps of the conversation. Our approach Optimistic Multi-step Preference Optimization (OMPO) is built upon the optimistic online mirror descent algorithm~\citep{rakhlin2013online,joulani17a}. Theoretically, we provide a rigorous analysis for the convergence of OMPO and show that OMPO requires $\mathcal{O}(ε^{-1})$ policy updates to converge to an $ε$-approximate Nash equilibrium. We also validate the effectiveness of our method on multi-turn conversations dataset and math reasoning dataset.
