Investigating Regularization of Self-Play Language Models
Reda Alami, Abdalgader Abubaker, Mastane Achab, Mohamed El Amine Seddik, Salem Lahlou
TL;DR
This work tackles instability in self-play fine-tuning (SPIN) for language model alignment by introducing two regularization strategies: an explicit KL penalty toward a base policy and a fictitious-play-inspired history smoothing that governs the opponent across iterations. The proposed α-SPIN framework incorporates a geometric mixture as the reference policy and a history length $h$ to blend past iterates, enabling controlled updates and potentially better convergence toward data distribution alignment. Empirical results on MT-Bench and the HuggingFace Open LLM Leaderboard show that KL regularization, history-length mixing, and fictitious play can yield improvements in several domains, with α in $(0,1)$ allowing interpolation between vanilla SPIN and DPO-like behavior. The findings highlight the practical value of regularized self-play and suggest promising avenues, including alternative sampler designs and IPO-based variants, for robust, low-cost alignment of large language models.
Abstract
This paper explores the effects of various forms of regularization in the context of language model alignment via self-play. While both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) require to collect costly human-annotated pairwise preferences, the self-play fine-tuning (SPIN) approach replaces the rejected answers by data generated from the previous iterate. However, the SPIN method presents a performance instability issue in the learning phase, which can be mitigated by playing against a mixture of the two previous iterates. In the same vein, we propose in this work to address this issue from two perspectives: first, by incorporating an additional Kullback-Leibler (KL) regularization to stay at the proximity of the reference policy; second, by using the idea of fictitious play which smoothens the opponent policy across all previous iterations. In particular, we show that the KL-based regularizer boils down to replacing the previous policy by its geometric mixture with the base policy inside of the SPIN loss function. We finally discuss empirical results on MT-Bench as well as on the Hugging Face Open LLM Leaderboard.
