On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

Tomasz Korbak; Hady Elsahar; Germán Kruszewski; Marc Dymetman

On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

Tomasz Korbak, Hady Elsahar, Germán Kruszewski, Marc Dymetman

TL;DR

This work investigates how Reward Maximization (RM) and Distribution Matching (DM) approaches relate in fine-tuning language models, showing that KL-control in RM corresponds to a DM objective with an emergent energy-based target. It then transfers a variance-reduction technique—baselines—from reinforcement learning to Distributional Policy Gradients (DPG), yielding GDC++ that better satisfies distributional constraints while preserving diversity and limiting catastrophic forgetting. Empirically, GDC++ outperforms standard GDC and RM baselines across multiple constraint types (pointwise and distributional) and batch sizes, with reduced gradient variance and more stable training. The findings suggest that cross-pollinating RM and DM ideas can improve controllable language generation by balancing constraint satisfaction, fidelity to the original model, and sample efficiency, while opening avenues for further integration of advanced RL techniques in DM frameworks.

Abstract

The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a training-from-scratch to a fine-tuning paradigm. While in some applications the goal is to "nudge" the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM). RM applies standard Reinforcement Learning (RL) techniques, such as Policy Gradients, to gradually increase the reward signal. DM prescribes to first make explicit the target distribution that the model is fine-tuned to approximate. Here we explore the theoretical connections between the two paradigms, and show that methods such as KL-control developed for RM can also be construed as belonging to DM. We further observe that while DM differs from RM, it can suffer from similar training difficulties, such as high gradient variance. We leverage connections between the two paradigms to import the concept of baseline into DM methods. We empirically validate the benefits of adding a baseline on an array of controllable language generation tasks such as constraining topic, sentiment, and gender distributions in texts sampled from a language model. We observe superior performance in terms of constraint satisfaction, stability and sample efficiency.

On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

TL;DR

Abstract

Paper Structure (42 sections, 1 theorem, 37 equations, 11 figures, 14 tables, 1 algorithm)

This paper contains 42 sections, 1 theorem, 37 equations, 11 figures, 14 tables, 1 algorithm.

Introduction
Reward Maximization
Reward Maximization with KL-Control
Distribution Matching
Background
Reward Maximization vs Distribution Matching
Standard vs. Parametric Rewards
KL-control as Distribution Matching
Similarities and Differences between DPG and Policy Gradients
A Case Study on Variance Reduction
Generation with Distributional Control
Experimental setup
Methods
Metrics
Training details
...and 27 more sections

Key Result

Theorem 1

Consider the following EBM: and let $p_z$ be the normalized distribution $p_z(x) = \frac{1}{Z}\; P_z(x)$, with $Z=\sum_x P_z(x)$. Then:

Figures (11)

Figure 1: In this study we make a connection between two popular paradigms for aligning language models to human preferences: Reward maximization (RM) and Distribution matching (DM).
Figure 2: Values of reward, advantage and the baseline for first 1000 epochs of a pointwise constraint experiment.
Figure 3: Evaluation metrics: $D_{\mathrm{KL}}(p, \pi_{\theta})$ ($\downarrow$ better), $\mathbb{E}_{{\pi_\theta}} \phi(x)$ ($\uparrow$ better), $D_{\mathrm{KL}}(\pi_{\theta}, a)$ ($\downarrow$ better), Self-BLEU-5 ($\downarrow$ better), and Distinct-1 ($\uparrow$ better) aggregated over 6 pointwise constraints experiments (tasks 1-6) for policies obtained from GDC++, GDC, Ziegler and Reinforce. See Figure \ref{['fig:distributional-compare-methods-metrics']} for aggregated distributional constraints experiments. In the Appendix Figures \ref{['fig:pointwise-compare-methods-split1']}-\ref{['fig:distributional-compare-methods-split']} and Table \ref{['tab:all_experiments_results']} contain individual view and final results of each run.
Figure 4: $\mathbb{E}_{{\pi_\theta}} \phi(x)$ or $\hat{\mu}$ per constraint ($\uparrow$ better) and $D_{\mathrm{KL}}(p, \pi_{\theta})$ ($\downarrow$ better) as a function of the number of samples reported for task 1 (a) and task 8 (b). We report the number of samples (i.e. the number of epochs times the batch size) for a fair comparison of convergence speed. GDC++ is consistently superior across all batch sizes in terms of convergence and constraint satisfaction. The effect is more conspicuous with small batch sizes. Batch sizes 512 and 2014 are greyed out for clarity.
Figure 5: Comparison between GDC and GDC++ using a set of Variance diagnosis metrics on pointwise and distributional constraints experiments.
...and 6 more figures

Theorems & Definitions (3)

Theorem 1
proof
proof

On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

TL;DR

Abstract

On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (3)