Eval all, trust a few, do wrong to none: Comparing sentence generation models
Ondřej Cífka, Aliaksei Severyn, Enrique Alfonseca, Katja Filippova
TL;DR
This paper addresses the lack of standardized evaluation for neural text generation by rigorously comparing plain, variational, and adversarial autoencoders on reconstruction and sampling tasks. It introduces a comprehensive metric suite, including a novel Fréchet InferSent Distance, and evaluates both generated samples and reconstructions with automatic and human judgments. The results reveal trade-offs among model families, with VAEs offering strong sampling yet weaker reconstruction and regularized variants balancing the two; simple regularization often matches or surpasses more complex adversarial approaches. The work proposes an evaluation standard that can guide future research and benchmarking in neural text generation.
Abstract
In this paper, we study recent neural generative models for text generation related to variational autoencoders. Previous works have employed various techniques to control the prior distribution of the latent codes in these models, which is important for sampling performance, but little attention has been paid to reconstruction error. In our study, we follow a rigorous evaluation protocol using a large set of previously used and novel automatic and human evaluation metrics, applied to both generated samples and reconstructions. We hope that it will become the new evaluation standard when comparing neural generative models for text.
