Table of Contents
Fetching ...

Eval all, trust a few, do wrong to none: Comparing sentence generation models

Ondřej Cífka, Aliaksei Severyn, Enrique Alfonseca, Katja Filippova

TL;DR

This paper addresses the lack of standardized evaluation for neural text generation by rigorously comparing plain, variational, and adversarial autoencoders on reconstruction and sampling tasks. It introduces a comprehensive metric suite, including a novel Fréchet InferSent Distance, and evaluates both generated samples and reconstructions with automatic and human judgments. The results reveal trade-offs among model families, with VAEs offering strong sampling yet weaker reconstruction and regularized variants balancing the two; simple regularization often matches or surpasses more complex adversarial approaches. The work proposes an evaluation standard that can guide future research and benchmarking in neural text generation.

Abstract

In this paper, we study recent neural generative models for text generation related to variational autoencoders. Previous works have employed various techniques to control the prior distribution of the latent codes in these models, which is important for sampling performance, but little attention has been paid to reconstruction error. In our study, we follow a rigorous evaluation protocol using a large set of previously used and novel automatic and human evaluation metrics, applied to both generated samples and reconstructions. We hope that it will become the new evaluation standard when comparing neural generative models for text.

Eval all, trust a few, do wrong to none: Comparing sentence generation models

TL;DR

This paper addresses the lack of standardized evaluation for neural text generation by rigorously comparing plain, variational, and adversarial autoencoders on reconstruction and sampling tasks. It introduces a comprehensive metric suite, including a novel Fréchet InferSent Distance, and evaluates both generated samples and reconstructions with automatic and human judgments. The results reveal trade-offs among model families, with VAEs offering strong sampling yet weaker reconstruction and regularized variants balancing the two; simple regularization often matches or surpasses more complex adversarial approaches. The work proposes an evaluation standard that can guide future research and benchmarking in neural text generation.

Abstract

In this paper, we study recent neural generative models for text generation related to variational autoencoders. Previous works have employed various techniques to control the prior distribution of the latent codes in these models, which is important for sampling performance, but little attention has been paid to reconstruction error. In our study, we follow a rigorous evaluation protocol using a large set of previously used and novel automatic and human evaluation metrics, applied to both generated samples and reconstructions. We hope that it will become the new evaluation standard when comparing neural generative models for text.

Paper Structure

This paper contains 21 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: t-SNE visualization of 100 different encodings (samples from the posterior distribution) of 10 sentences, using various models.
  • Figure 2: Random samples from different models.
  • Figure 3: Sentences generated by interpolating between the encodings of "A Doncaster man suffered life threatening injuries after a collision." and "Barcelona will appeal the transfer ban.".
  • Figure 4: Sentences generated by interpolating between the encodings of "A woman is accused of leaving her children at home." and "The Union Buildings have been declared a national heritage site.".