Open-Domain Text Evaluation via Contrastive Distribution Methods
Sidi Lu, Hongyi Liu, Asli Celikyilmaz, Tianlu Wang, Nanyun Peng
TL;DR
This work introduces Contrastive Distribution Methods (CDM) for open-domain text evaluation, framing model quality as an oracle function $E(p)$ and leveraging a partial-order across model sizes to contrast distributions from two models. It develops two evaluation paradigms: Generative CDM, which synthesizes challenging negative samples via a degraded distribution to train a discriminator, and Discriminative CDM, which directly aggregates step-wise contrastive momentum between an amateur and an expert model as a quality score. The authors demonstrate that CDM yields higher correlation with human judgments than strong baselines on multi-turn dialogue coherence and commonsense generation tasks, including a CommonsGen-trinity evaluation where CDM achieves state-of-the-art. Overall, CDM provides a scalable, reference-free, distribution-focused framework for evaluating open-domain generation with practical impact for model development and benchmarking.
Abstract
Recent advancements in open-domain text generation, driven by the power of large pre-trained language models (LLMs), have demonstrated remarkable performance. However, assessing these models' generation quality remains a challenge. In this paper, we introduce a novel method for evaluating open-domain text generation called Contrastive Distribution Methods (CDM). Leveraging the connection between increasing model parameters and enhanced LLM performance, CDM creates a mapping from the _contrast_ of two probabilistic distributions -- one known to be superior to the other -- to quality measures. We investigate CDM for open-domain text generation evaluation under two paradigms: 1) _Generative_ CDM, which harnesses the contrast of two language models' distributions to generate synthetic examples for training discriminator-based metrics; 2) _Discriminative_ CDM, which directly uses distribution disparities between two language models for evaluation. Our experiments on coherence evaluation for multi-turn dialogue and commonsense evaluation for controllable generation demonstrate CDM's superior correlate with human judgment than existing automatic evaluation metrics, highlighting the strong performance and generalizability of our approach.
