Table of Contents
Fetching ...

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers

Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, Zaid Harchaoui

TL;DR

MAUVE introduces a principled divergence-frontier based metric for open-ended text generation that compares neural text distributions to human text by embedding and quantizing text representations and evaluating KL divergences along a family of mixtures. The resulting divergence curve and its area provide a robust, single-score summary that captures both quality and coverage, correlating strongly with human judgments and outperforming several traditional metrics. The method demonstrates stability across embedding and quantization choices, scales with model size and decoding strategy, and is accompanied by an open-source implementation. This work offers a practical, domain-agnostic tool for evaluating modern text generators and suggests extensions to closed-ended tasks like summarization and translation in future work.

Abstract

As major progress is made in open-ended text generation, measuring how close machine-generated text is to human language remains a critical open problem. We introduce MAUVE, a comparison measure for open-ended text generation, which directly compares the learnt distribution from a text generation model to the distribution of human-written text using divergence frontiers. MAUVE scales up to modern text generation models by computing information divergences in a quantized embedding space. Through an extensive empirical study on three open-ended generation tasks, we find that MAUVE identifies known properties of generated text, scales naturally with model size, and correlates with human judgments, with fewer restrictions than existing distributional evaluation metrics.

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers

TL;DR

MAUVE introduces a principled divergence-frontier based metric for open-ended text generation that compares neural text distributions to human text by embedding and quantizing text representations and evaluating KL divergences along a family of mixtures. The resulting divergence curve and its area provide a robust, single-score summary that captures both quality and coverage, correlating strongly with human judgments and outperforming several traditional metrics. The method demonstrates stability across embedding and quantization choices, scales with model size and decoding strategy, and is accompanied by an open-source implementation. This work offers a practical, domain-agnostic tool for evaluating modern text generators and suggests extensions to closed-ended tasks like summarization and translation in future work.

Abstract

As major progress is made in open-ended text generation, measuring how close machine-generated text is to human language remains a critical open problem. We introduce MAUVE, a comparison measure for open-ended text generation, which directly compares the learnt distribution from a text generation model to the distribution of human-written text using divergence frontiers. MAUVE scales up to modern text generation models by computing information divergences in a quantized embedding space. Through an extensive empirical study on three open-ended generation tasks, we find that MAUVE identifies known properties of generated text, scales naturally with model size, and correlates with human judgments, with fewer restrictions than existing distributional evaluation metrics.

Paper Structure

This paper contains 36 sections, 2 theorems, 20 equations, 10 figures, 14 tables, 1 algorithm.

Key Result

Proposition 1

Consider two distributions $P, Q$ with finite support and a scaling constant $c > 0$. Let $R_\lambda$ be such that $(e^{-c\, {\mathrm{KL}}(Q|R_\lambda)}, e^{-c\, {\mathrm{KL}}(P|R_\lambda)}) \in \mathcal{C}(P, Q)$. Then, $R_\lambda$ is Pareto-optimal for the pair of objectives $({\mathrm{KL}}(Q|\cdo

Figures (10)

  • Figure 1: Left: Mauve compares the machine text distribution $Q$ to that of human text $P$ by using the family of mixtures $R_\lambda = \lambda P + (1-\lambda) Q$ for $\lambda \in (0, 1)$. Right: Illustration of Type I errors, where $Q$ produces degenerate, repetitive text which is unlikely under $P$, and, Type II errors, where $Q$ cannot produce plausible human text due to truncation heuristics holtzman2019curious. Mauve measures these errors softly, by using the mixture distribution $R_\lambda$. Varying $\lambda$ in $(0, 1)$ gives a divergence curve and captures a spectrum of soft Type I and Type II errors. Mauve summarizes the entire divergence curve in a single scalar as the area under this curve.
  • Figure 2: Divergence curves for different models (GPT-2 radford2019language, Grover zellers2019grover) and decoding algorithms (greedy decoding, ancestral and nucleus sampling). Mauve is computed as the area of the shaded region, and larger values of Mauve indicate that $Q$ is closer to $P$. In general, Mauve indicates that generations from larger models and nucleus sampling are closer to human text. Rightmost: Nucleus sampling has a slightly smaller Type I error than ancestral sampling but a higher Type II error, indicating that ancestral sampling with Grover base produces more degenerate text while nucleus sampling does not effectively cover the human text distribution.
  • Figure 3: Illustration of the quantization. Left: A continuous two-dimensional distribution $P$. Right: A partitioning of the Euclidean plane $\mathbb{R}^2$ and the corresponding quantized distribution $\tilde{P}$.
  • Figure 4: Generation quality versus maximum generation length according to Mauve and three alternative measures (web text, GPT-2). Mauve is the only comparison measure which identifies that generation quality decreases monotonically with increasing text length. The shaded area shows one standard deviation over generations from 5 random seeds.
  • Figure 5: Left: Mauve computed using GPT-2 (default) and RoBERTa liu2019roberta embeddings, across model sizes and decoding algorithms; see Table \ref{['tab:mauve:expt:bert-features-appendix']} in the Appendix for further results. The Spearman rank correlation between the two is 0.993 across model sizes and decoding algorithms. Right: Effect of the scaling constant $c$ on Mauve. Choice of $c$ does not affect the relative order of the curves but only the numerical value. We use $c=5$ to get interpretable values with both nucleus and greedy decoding.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Lemma 2
  • proof