Table of Contents
Fetching ...

Jointly Measuring Diversity and Quality in Text Generation Models

Ehsan Montahaei, Danial Alihosseini, Mahdieh Soleymani Baghshah

TL;DR

The paper tackles the evaluation gap in text generation by proposing joint quality-diversity metrics. It introduces MS-Jaccard (n-gram distribution similarity), Fréchet BERT Distance (FBD) in BERT feature space, and a symmetric Bhattacharyya distance for oracle-based evaluation to simultaneously assess the realism and diversity of generated text. Through experiments on COCO Captions, EMNLP2017 WMT News, IMDB, and a synthetic oracle, the authors show that these metrics correlate well and often favor MLE over GAN-based methods in terms of coverage and quality. The work provides practical, distribution-level tools for more robust NLG evaluation and highlights the value of considering the full quality-diversity spectrum when comparing generators.

Abstract

Text generation is an important Natural Language Processing task with various applications. Although several metrics have already been introduced to evaluate the text generation methods, each of them has its own shortcomings. The most widely used metrics such as BLEU only consider the quality of generated sentences and neglect their diversity. For example, repeatedly generation of only one high quality sentence would result in a high BLEU score. On the other hand, the more recent metric introduced to evaluate the diversity of generated texts known as Self-BLEU ignores the quality of generated texts. In this paper, we propose metrics to evaluate both the quality and diversity simultaneously by approximating the distance of the learned generative model and the real data distribution. For this purpose, we first introduce a metric that approximates this distance using n-gram based measures. Then, a feature-based measure which is based on a recent highly deep model trained on a large text corpus called BERT is introduced. Finally, for oracle training mode in which the generator's density can also be calculated, we propose to use the distance measures between the corresponding explicit distributions. Eventually, the most popular and recent text generation models are evaluated using both the existing and the proposed metrics and the preferences of the proposed metrics are determined.

Jointly Measuring Diversity and Quality in Text Generation Models

TL;DR

The paper tackles the evaluation gap in text generation by proposing joint quality-diversity metrics. It introduces MS-Jaccard (n-gram distribution similarity), Fréchet BERT Distance (FBD) in BERT feature space, and a symmetric Bhattacharyya distance for oracle-based evaluation to simultaneously assess the realism and diversity of generated text. Through experiments on COCO Captions, EMNLP2017 WMT News, IMDB, and a synthetic oracle, the authors show that these metrics correlate well and often favor MLE over GAN-based methods in terms of coverage and quality. The work provides practical, distribution-level tools for more robust NLG evaluation and highlights the value of considering the full quality-diversity spectrum when comparing generators.

Abstract

Text generation is an important Natural Language Processing task with various applications. Although several metrics have already been introduced to evaluate the text generation methods, each of them has its own shortcomings. The most widely used metrics such as BLEU only consider the quality of generated sentences and neglect their diversity. For example, repeatedly generation of only one high quality sentence would result in a high BLEU score. On the other hand, the more recent metric introduced to evaluate the diversity of generated texts known as Self-BLEU ignores the quality of generated texts. In this paper, we propose metrics to evaluate both the quality and diversity simultaneously by approximating the distance of the learned generative model and the real data distribution. For this purpose, we first introduce a metric that approximates this distance using n-gram based measures. Then, a feature-based measure which is based on a recent highly deep model trained on a large text corpus called BERT is introduced. Finally, for oracle training mode in which the generator's density can also be calculated, we propose to use the distance measures between the corresponding explicit distributions. Eventually, the most popular and recent text generation models are evaluated using both the existing and the proposed metrics and the preferences of the proposed metrics are determined.

Paper Structure

This paper contains 19 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Diversity vs. quality measure of various models with temperatures from $1.5^{-3}$ to $1.5^4$ on different datasets. Each point in the plot corresponds to the performance of a model in a special temperature (A second-degree polynomial has been fitted to the points). Lower values in both axes show better ones.
  • Figure 2: NLL, $1-$MS-Jaccard4, and FBD scores of all the models without applying temperature (i.e. $T=1$) on different datasets. Lower values show better performance.
  • Figure 3: The performance of all models (without applying temperature, i.e. $T=1$) on the Oracle dataset using different measures. Lower values show better performance.
  • Figure 4: Pearson correlation of all metrics when aggregating results on the real world text datasets and all temperatures.