MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers
Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, Zaid Harchaoui
TL;DR
MAUVE introduces a principled divergence-frontier based metric for open-ended text generation that compares neural text distributions to human text by embedding and quantizing text representations and evaluating KL divergences along a family of mixtures. The resulting divergence curve and its area provide a robust, single-score summary that captures both quality and coverage, correlating strongly with human judgments and outperforming several traditional metrics. The method demonstrates stability across embedding and quantization choices, scales with model size and decoding strategy, and is accompanied by an open-source implementation. This work offers a practical, domain-agnostic tool for evaluating modern text generators and suggests extensions to closed-ended tasks like summarization and translation in future work.
Abstract
As major progress is made in open-ended text generation, measuring how close machine-generated text is to human language remains a critical open problem. We introduce MAUVE, a comparison measure for open-ended text generation, which directly compares the learnt distribution from a text generation model to the distribution of human-written text using divergence frontiers. MAUVE scales up to modern text generation models by computing information divergences in a quantized embedding space. Through an extensive empirical study on three open-ended generation tasks, we find that MAUVE identifies known properties of generated text, scales naturally with model size, and correlates with human judgments, with fewer restrictions than existing distributional evaluation metrics.
