Table of Contents
Fetching ...

A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models

Yi-Lin Tuan, William Yang Wang

TL;DR

This paper presents a unique angle of gradient analysis of loss functions that simultaneously reward good examples and penalize bad ones in LMs and finds that ExMATE serves as a superior surrogate for MLE, and that combining DPO with ExMATE instead of MLE further enhances both the statistical and generative performance.

Abstract

Beyond maximum likelihood estimation (MLE), the standard objective of a language model (LM) that optimizes good examples probabilities, many studies have explored ways that also penalize bad examples for enhancing the quality of output distribution, including unlikelihood training, exponential maximizing average treatment effect (ExMATE), and direct preference optimization (DPO). To systematically compare these methods and further provide a unified recipe for LM optimization, in this paper, we present a unique angle of gradient analysis of loss functions that simultaneously reward good examples and penalize bad ones in LMs. Through both mathematical results and experiments on CausalDialogue and Anthropic HH-RLHF datasets, we identify distinct functional characteristics among these methods. We find that ExMATE serves as a superior surrogate for MLE, and that combining DPO with ExMATE instead of MLE further enhances both the statistical (5-7%) and generative (+18% win rate) performance.

A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models

TL;DR

This paper presents a unique angle of gradient analysis of loss functions that simultaneously reward good examples and penalize bad ones in LMs and finds that ExMATE serves as a superior surrogate for MLE, and that combining DPO with ExMATE instead of MLE further enhances both the statistical and generative performance.

Abstract

Beyond maximum likelihood estimation (MLE), the standard objective of a language model (LM) that optimizes good examples probabilities, many studies have explored ways that also penalize bad examples for enhancing the quality of output distribution, including unlikelihood training, exponential maximizing average treatment effect (ExMATE), and direct preference optimization (DPO). To systematically compare these methods and further provide a unified recipe for LM optimization, in this paper, we present a unique angle of gradient analysis of loss functions that simultaneously reward good examples and penalize bad ones in LMs. Through both mathematical results and experiments on CausalDialogue and Anthropic HH-RLHF datasets, we identify distinct functional characteristics among these methods. We find that ExMATE serves as a superior surrogate for MLE, and that combining DPO with ExMATE instead of MLE further enhances both the statistical (5-7%) and generative (+18% win rate) performance.
Paper Structure (35 sections, 1 theorem, 21 equations, 7 figures, 1 table)

This paper contains 35 sections, 1 theorem, 21 equations, 7 figures, 1 table.

Key Result

Lemma 4.1

In LMs with softmax function for final prediction, the Gradient Difference is determined by (1) the softmax distribution difference $\|P_\theta(\cdot|x^+,y^+_{<t}) - P_\theta(\cdot|x^-,y^-_{<t})\|_p$ (we use it as the gradient difference in the rest of the paper) and (2) the sameness of target outpu

Figures (7)

  • Figure 1: (a) DPO, (b) Unlikelihood, and (c) ExMATE loss functions when taking only either $P_\theta(y^+|x^+)$ (positive examples) or $P_\theta(y^-|x^-)$ (negative examples) as the control variables. We plot DPO in the case of $P_{ref}(\cdot)=1$, $\beta=1$, and $P_\theta(y^-|x^-)$ or $P_\theta(y^+|x^+)$ is 0.1 for easy visualization. Their function characteristics are different, thus making them suitable for difference use cases.
  • Figure 2: The estimated gradients of DPO, Unlikelihood, and ExMATE for time steps $t$ that $y^+_t = y^-_t$.
  • Figure 3: (a) All model's information differences on CausalDialogue are nearly zero (<1e-26). (b) information differences on Anthropic HH-RLHF are higher than on CausalDialogue. (c) All model's gradient differences on CausalDialogue and the first three time steps. All are small, especially for the first time step, randomly initialized models, and larger number of parameters.
  • Figure 4: (a) Perplexity (log scale) and agility of MLE, DPO, Unlikelihood, and ExMATE on CausalDialogue. Unlikelihood improves agility, DPO degrades both, and ExMATE is preferred for improving both. (b) Perplexity and agility of SFT(MLE), DPO, Unlikelihood, and ExMATE on Anthropic HH-RLHF. DPO achieves high agility by compromising perplexity; ExMATE improves both metrics by small values.
  • Figure 5: Fine-tuned Pythia 6.9B by SFT, Unlikelihood (UL), ExMATE, or SFT+DPO.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Definition 4.1
  • Definition 4.2
  • Lemma 4.1