Table of Contents
Fetching ...

Open-Domain Text Evaluation via Contrastive Distribution Methods

Sidi Lu, Hongyi Liu, Asli Celikyilmaz, Tianlu Wang, Nanyun Peng

TL;DR

This work introduces Contrastive Distribution Methods (CDM) for open-domain text evaluation, framing model quality as an oracle function $E(p)$ and leveraging a partial-order across model sizes to contrast distributions from two models. It develops two evaluation paradigms: Generative CDM, which synthesizes challenging negative samples via a degraded distribution to train a discriminator, and Discriminative CDM, which directly aggregates step-wise contrastive momentum between an amateur and an expert model as a quality score. The authors demonstrate that CDM yields higher correlation with human judgments than strong baselines on multi-turn dialogue coherence and commonsense generation tasks, including a CommonsGen-trinity evaluation where CDM achieves state-of-the-art. Overall, CDM provides a scalable, reference-free, distribution-focused framework for evaluating open-domain generation with practical impact for model development and benchmarking.

Abstract

Recent advancements in open-domain text generation, driven by the power of large pre-trained language models (LLMs), have demonstrated remarkable performance. However, assessing these models' generation quality remains a challenge. In this paper, we introduce a novel method for evaluating open-domain text generation called Contrastive Distribution Methods (CDM). Leveraging the connection between increasing model parameters and enhanced LLM performance, CDM creates a mapping from the _contrast_ of two probabilistic distributions -- one known to be superior to the other -- to quality measures. We investigate CDM for open-domain text generation evaluation under two paradigms: 1) _Generative_ CDM, which harnesses the contrast of two language models' distributions to generate synthetic examples for training discriminator-based metrics; 2) _Discriminative_ CDM, which directly uses distribution disparities between two language models for evaluation. Our experiments on coherence evaluation for multi-turn dialogue and commonsense evaluation for controllable generation demonstrate CDM's superior correlate with human judgment than existing automatic evaluation metrics, highlighting the strong performance and generalizability of our approach.

Open-Domain Text Evaluation via Contrastive Distribution Methods

TL;DR

This work introduces Contrastive Distribution Methods (CDM) for open-domain text evaluation, framing model quality as an oracle function and leveraging a partial-order across model sizes to contrast distributions from two models. It develops two evaluation paradigms: Generative CDM, which synthesizes challenging negative samples via a degraded distribution to train a discriminator, and Discriminative CDM, which directly aggregates step-wise contrastive momentum between an amateur and an expert model as a quality score. The authors demonstrate that CDM yields higher correlation with human judgments than strong baselines on multi-turn dialogue coherence and commonsense generation tasks, including a CommonsGen-trinity evaluation where CDM achieves state-of-the-art. Overall, CDM provides a scalable, reference-free, distribution-focused framework for evaluating open-domain generation with practical impact for model development and benchmarking.

Abstract

Recent advancements in open-domain text generation, driven by the power of large pre-trained language models (LLMs), have demonstrated remarkable performance. However, assessing these models' generation quality remains a challenge. In this paper, we introduce a novel method for evaluating open-domain text generation called Contrastive Distribution Methods (CDM). Leveraging the connection between increasing model parameters and enhanced LLM performance, CDM creates a mapping from the _contrast_ of two probabilistic distributions -- one known to be superior to the other -- to quality measures. We investigate CDM for open-domain text generation evaluation under two paradigms: 1) _Generative_ CDM, which harnesses the contrast of two language models' distributions to generate synthetic examples for training discriminator-based metrics; 2) _Discriminative_ CDM, which directly uses distribution disparities between two language models for evaluation. Our experiments on coherence evaluation for multi-turn dialogue and commonsense evaluation for controllable generation demonstrate CDM's superior correlate with human judgment than existing automatic evaluation metrics, highlighting the strong performance and generalizability of our approach.
Paper Structure (35 sections, 4 equations, 3 figures, 10 tables, 1 algorithm)

This paper contains 35 sections, 4 equations, 3 figures, 10 tables, 1 algorithm.

Figures (3)

  • Figure 1: Conceptual illustration of the Contrastive Distribution Methods (CDM). (a) Generative CDM generates negative examples for training a discriminator-based metric. (b) Discriminative CDM directly evaluate the distribution/sequence by contrasting the step-wise likelihood scores.
  • Figure 2: (a) While it is hard to assume a total order for models from different model classes under the oracle metric $E(p)$, it is plausible to assume partial orders for models from the same model class. (b) Generative CDM uses the degraded distribution $p_n$ to synthesize fake samples for training a discriminator as the metric. The warm/cold region indicates the decision boundary of the resulting trainable metric induced by fake samples from $p_n$. (c) Discriminative CDM directly determines the decision boundary by pooling the values of the step-wise contrastive momentum.
  • Figure 3: A more detailed illustration of the two Contrastive Distribution Methods (CDM). (a) Generative CDM constructs fake negative samples from positive ones for training a discriminator-based metric. (b) Discriminative CDM directly evaluate the distribution/sequence by contrasting and aggregating the step-wise likelihood scores.