Table of Contents
Fetching ...

EMO: Earth Mover Distance Optimization for Auto-Regressive Language Modeling

Siyu Ren, Zhiyong Wu, Kenny Q. Zhu

TL;DR

The paper addresses misalignment between model and human language distributions when training autoregressive LMs by identifying weaknesses of forward cross-entropy. It introduces EMO, which optimizes a differentiable upper bound of Earth Mover Distance with a semantically informed transport cost, to balance precision and recall and improve train-test consistency. Empirical results show EMO outperforms MLE and strong baselines across open-ended generation and downstream NLU tasks, including efficient continual fine-tuning that yields notable gains with limited data. The findings suggest EMO as a practical, scalable calibration method for enhancing large pre-trained language models in diverse domains.

Abstract

Neural language models are probabilistic models of human text. They are predominantly trained using maximum likelihood estimation (MLE), which is equivalent to minimizing the forward cross-entropy between the empirical data distribution and the model distribution. However, various degeneration phenomena are still widely observed when decoding from the distributions learned by such models. We establish that the forward cross-entropy is suboptimal as a distance metric for aligning human and model distribution due to its (1) recall-prioritization (2) negative diversity ignorance and (3) train-test mismatch. In this paper, we propose Earth Mover Distance Optimization (EMO) for auto-regressive language modeling. EMO capitalizes on the inherent properties of earth mover distance to address the aforementioned challenges. Due to the high complexity of direct computation, we further introduce a feasible upper bound for EMO to ease end-to-end training. Upon extensive evaluation of language models trained using EMO and MLE. We find that EMO demonstrates a consistently better language modeling performance than MLE across domains. Moreover, EMO demonstrates noteworthy enhancements in downstream performance with minimal fine-tuning on merely 25,000 sentences. This highlights the tremendous potential of EMO as a lightweight calibration method for enhancing large-scale pre-trained language models.

EMO: Earth Mover Distance Optimization for Auto-Regressive Language Modeling

TL;DR

The paper addresses misalignment between model and human language distributions when training autoregressive LMs by identifying weaknesses of forward cross-entropy. It introduces EMO, which optimizes a differentiable upper bound of Earth Mover Distance with a semantically informed transport cost, to balance precision and recall and improve train-test consistency. Empirical results show EMO outperforms MLE and strong baselines across open-ended generation and downstream NLU tasks, including efficient continual fine-tuning that yields notable gains with limited data. The findings suggest EMO as a practical, scalable calibration method for enhancing large pre-trained language models in diverse domains.

Abstract

Neural language models are probabilistic models of human text. They are predominantly trained using maximum likelihood estimation (MLE), which is equivalent to minimizing the forward cross-entropy between the empirical data distribution and the model distribution. However, various degeneration phenomena are still widely observed when decoding from the distributions learned by such models. We establish that the forward cross-entropy is suboptimal as a distance metric for aligning human and model distribution due to its (1) recall-prioritization (2) negative diversity ignorance and (3) train-test mismatch. In this paper, we propose Earth Mover Distance Optimization (EMO) for auto-regressive language modeling. EMO capitalizes on the inherent properties of earth mover distance to address the aforementioned challenges. Due to the high complexity of direct computation, we further introduce a feasible upper bound for EMO to ease end-to-end training. Upon extensive evaluation of language models trained using EMO and MLE. We find that EMO demonstrates a consistently better language modeling performance than MLE across domains. Moreover, EMO demonstrates noteworthy enhancements in downstream performance with minimal fine-tuning on merely 25,000 sentences. This highlights the tremendous potential of EMO as a lightweight calibration method for enhancing large-scale pre-trained language models.
Paper Structure (50 sections, 15 equations, 6 figures, 11 tables)

This paper contains 50 sections, 15 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Scaling law of EMO with respect to model scale and data size.
  • Figure 2: The average of token-level forward and reverse cross-entropy between distribution $Q_{\theta}$ of GPT-2 fine-tuned with different objectives and that of GPT-Neo-1.3B on the validation set of three different datasets. The lower the value, the better the learned $Q_{\theta}$ balance precision and recall.
  • Figure 3: Auto-J pairwise response comparison results of LLaMa-7B/13B fine-tuned with MLE and EMO on 805 test instructions from AlpacaEval.
  • Figure 4: Auto-J pairwise response comparison results of LLaMa2-7B/13B fine-tuned with MLE and EMO on 805 test instructions from AlpacaEval.
  • Figure 5: PandaLM pairwise response comparison results of LLaMa-7B/13B fine-tuned with MLE and EMO on 805 test instructions from AlpacaEval.
  • ...and 1 more figures