Investigating Regularization of Self-Play Language Models

Reda Alami; Abdalgader Abubaker; Mastane Achab; Mohamed El Amine Seddik; Salem Lahlou

Investigating Regularization of Self-Play Language Models

Reda Alami, Abdalgader Abubaker, Mastane Achab, Mohamed El Amine Seddik, Salem Lahlou

TL;DR

This work tackles instability in self-play fine-tuning (SPIN) for language model alignment by introducing two regularization strategies: an explicit KL penalty toward a base policy and a fictitious-play-inspired history smoothing that governs the opponent across iterations. The proposed α-SPIN framework incorporates a geometric mixture as the reference policy and a history length $h$ to blend past iterates, enabling controlled updates and potentially better convergence toward data distribution alignment. Empirical results on MT-Bench and the HuggingFace Open LLM Leaderboard show that KL regularization, history-length mixing, and fictitious play can yield improvements in several domains, with α in $(0,1)$ allowing interpolation between vanilla SPIN and DPO-like behavior. The findings highlight the practical value of regularized self-play and suggest promising avenues, including alternative sampler designs and IPO-based variants, for robust, low-cost alignment of large language models.

Abstract

This paper explores the effects of various forms of regularization in the context of language model alignment via self-play. While both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) require to collect costly human-annotated pairwise preferences, the self-play fine-tuning (SPIN) approach replaces the rejected answers by data generated from the previous iterate. However, the SPIN method presents a performance instability issue in the learning phase, which can be mitigated by playing against a mixture of the two previous iterates. In the same vein, we propose in this work to address this issue from two perspectives: first, by incorporating an additional Kullback-Leibler (KL) regularization to stay at the proximity of the reference policy; second, by using the idea of fictitious play which smoothens the opponent policy across all previous iterations. In particular, we show that the KL-based regularizer boils down to replacing the previous policy by its geometric mixture with the base policy inside of the SPIN loss function. We finally discuss empirical results on MT-Bench as well as on the Hugging Face Open LLM Leaderboard.

Investigating Regularization of Self-Play Language Models

TL;DR

to blend past iterates, enabling controlled updates and potentially better convergence toward data distribution alignment. Empirical results on MT-Bench and the HuggingFace Open LLM Leaderboard show that KL regularization, history-length mixing, and fictitious play can yield improvements in several domains, with α in

allowing interpolation between vanilla SPIN and DPO-like behavior. The findings highlight the practical value of regularized self-play and suggest promising avenues, including alternative sampler designs and IPO-based variants, for robust, low-cost alignment of large language models.

Abstract

Paper Structure (22 sections, 14 equations, 6 figures, 1 table, 2 algorithms)

This paper contains 22 sections, 14 equations, 6 figures, 1 table, 2 algorithms.

Introduction
Outline
Notations
Related work
Reinforcement Learning from Human Feedback (RLHF)
Direct Preference Optimization (DPO)
Alignment methods based on alternative loss functions
Self-Play Fine-Tuning (SPIN)
Regularization of self-play fine-tuning
From KL regularization to geometric mixture
Fictitious play data generation
$\alpha$-SPIN algorithm
Performance evaluation
Experimental setup
Investigating the effect of KL regularization
...and 7 more sections

Figures (6)

Figure 1: Framework of regularized self-play fine tuning (Algorithm \ref{['alg:alphaSPIN']}). The pair context/winner-answer is picked from the SFT dataset: $(x,y_w) \in \mathcal{D}_{\text{SFT}}$. The negative response $y_l$ is generated according to either the previous policy $\pi_{\theta_{t-1}}$ or a mixture of the previous policies. The fine-tuning of the model $\pi_{\theta_{t}}$ is done using the maximum likelihood estimation in the DPO approach against the reference model $\pi _{\text{ref}} \propto \pi _{\theta _{t-1}}^{\alpha} \times \pi _{\text{base}}^{1-\alpha}$ for a given $\alpha \in (0,1)$.
Figure 2: $\alpha$-SPIN ($h=2$) vs vanilla SPIN.
Figure 3: $\alpha$-SPIN ($h=2$) vs $\alpha$-SPIN ($h=1$).
Figure 4: Fictitious play effect on $\alpha$-SPIN.
Figure 5: Fictitious play effect on SPIN.
...and 1 more figures

Theorems & Definitions (1)

Remark 1: Official implementation of SPIN

Investigating Regularization of Self-Play Language Models

TL;DR

Abstract

Investigating Regularization of Self-Play Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (1)