Table of Contents
Fetching ...

Preserving Diversity in Supervised Fine-Tuning of Large Language Models

Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, Ruoyu Sun

TL;DR

The paper critiques standard supervised fine-tuning of large language models for over-emphasizing likelihood via cross-entropy, which reduces output diversity and exacerbates forgetting. It introduces a game-theoretic framework with an auxiliary meta-controller, yielding a GEM training algorithm that performs distribution matching with entropy regularization, effectively approximating reverse KL minimization while preserving diversity. Empirical results across 3B–70B models show GEM matches CE in downstream performance but delivers substantially higher output diversity, better test-time sampling efficiency, and reduced alignment tax when fine-tuning on instruction-like data. The work suggests that maintaining diversity during SFT can enhance exploration in subsequent stages (e.g., RLHF, self-improvement) and mitigate catastrophic forgetting, with practical implications for scalable, diverse generation in LLMs.

Abstract

Large Language Models (LLMs) typically rely on Supervised Fine-Tuning (SFT) to specialize in downstream tasks, with the Cross Entropy (CE) loss being the de facto choice. However, CE maximizes the likelihood of observed data without accounting for alternative possibilities. As such, CE usually leads to reduced diversity in the model's outputs, which hinders further development that requires sampling to explore better responses. To address this limitation, this paper introduces a new game-theoretic formulation for SFT. In this framework, an auxiliary variable is introduced to regulate the learning process. We prove that the proposed game-theoretic approach connects to the problem of reverse KL minimization with entropy regularization. This regularization prevents over-memorization of training data and promotes output diversity. To implement this framework, we develop GEM, a new training algorithm that is computationally efficient as CE by leveraging some unique properties of LLMs. Empirical studies of pre-trained models from 3B to 70B parameters show that GEM achieves comparable downstream performance to CE while significantly enhancing output diversity. This increased diversity translates to performance gains in test-time compute scaling for chat and code generation tasks. Moreover, we observe that preserving output diversity has the added benefit of mitigating forgetting, as maintaining diverse outputs encourages models to retain pre-trained knowledge throughout the training process.

Preserving Diversity in Supervised Fine-Tuning of Large Language Models

TL;DR

The paper critiques standard supervised fine-tuning of large language models for over-emphasizing likelihood via cross-entropy, which reduces output diversity and exacerbates forgetting. It introduces a game-theoretic framework with an auxiliary meta-controller, yielding a GEM training algorithm that performs distribution matching with entropy regularization, effectively approximating reverse KL minimization while preserving diversity. Empirical results across 3B–70B models show GEM matches CE in downstream performance but delivers substantially higher output diversity, better test-time sampling efficiency, and reduced alignment tax when fine-tuning on instruction-like data. The work suggests that maintaining diversity during SFT can enhance exploration in subsequent stages (e.g., RLHF, self-improvement) and mitigate catastrophic forgetting, with practical implications for scalable, diverse generation in LLMs.

Abstract

Large Language Models (LLMs) typically rely on Supervised Fine-Tuning (SFT) to specialize in downstream tasks, with the Cross Entropy (CE) loss being the de facto choice. However, CE maximizes the likelihood of observed data without accounting for alternative possibilities. As such, CE usually leads to reduced diversity in the model's outputs, which hinders further development that requires sampling to explore better responses. To address this limitation, this paper introduces a new game-theoretic formulation for SFT. In this framework, an auxiliary variable is introduced to regulate the learning process. We prove that the proposed game-theoretic approach connects to the problem of reverse KL minimization with entropy regularization. This regularization prevents over-memorization of training data and promotes output diversity. To implement this framework, we develop GEM, a new training algorithm that is computationally efficient as CE by leveraging some unique properties of LLMs. Empirical studies of pre-trained models from 3B to 70B parameters show that GEM achieves comparable downstream performance to CE while significantly enhancing output diversity. This increased diversity translates to performance gains in test-time compute scaling for chat and code generation tasks. Moreover, we observe that preserving output diversity has the added benefit of mitigating forgetting, as maintaining diverse outputs encourages models to retain pre-trained knowledge throughout the training process.
Paper Structure (30 sections, 3 theorems, 25 equations, 10 figures, 5 tables, 2 algorithms)

This paper contains 30 sections, 3 theorems, 25 equations, 10 figures, 5 tables, 2 algorithms.

Key Result

Proposition 1

The gradient of CE specifies a logit flow map: each source token $j$ transfers $f_{\theta}(j|x)$ logits to the target token $i$. Formally, where $w_{i \leftarrow j}$ acts as weighting factor, and $e_{i \leftarrow j}$ is a vector with $i$-th element being $1$ and the $j$-th element being $-1$ and $0$ otherwise. Furthermore, the logit flow satisfies a conservation property, ensuring that logits red

Figures (10)

  • Figure 1: Illustration of diversity preservation in SFT. While pre-trained LLMs produce diverse outputs, these often lack proper formatting. Standard SFT using CE improves readability but reduces diversity. We aim to maintain output diversity while enhancing the readability of LLMs' responses.
  • Figure 2: Comparison of learning schemes: CE v.s. GEM ($\beta = 0$). The arrows illustrate the probability movement directions during the learning process, with Token 3 as the target token.
  • Figure 3: Illustration of the meta-controller $q$.
  • Figure 4: Enhancing output diversity boosts the win rate when using BoN.
  • Figure 5: Performance of test-time scaling. The results demonstrate that GEM achieves better performance with the same sampling budget and is more efficient in reaching comparable performance.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • proof : Proof of \ref{['prop:main']}