Adversarial Evaluation of Dialogue Models
Anjuli Kannan, Oriol Vinyals
TL;DR
An adversarial, discriminator-based approach is explored as a proxy for human evaluation of dialogue models, testing whether a discriminator can separate machine-generated from human responses. The study uses a production Smart Reply generator and finds discriminator accuracy around 62.5%, and analyses reveal length bias and limited diversity as key weaknesses. However, fooling the discriminator does not reliably indicate higher human-perceived quality, signaling that more work is needed to make adversarial evaluation practical. The paper discusses benefits, limitations, and directions for future research in adversarial dialogue evaluation.
Abstract
The recent application of RNN encoder-decoder models has resulted in substantial progress in fully data-driven dialogue systems, but evaluation remains a challenge. An adversarial loss could be a way to directly evaluate the extent to which generated dialogue responses sound like they came from a human. This could reduce the need for human evaluation, while more directly evaluating on a generative task. In this work, we investigate this idea by training an RNN to discriminate a dialogue model's samples from human-generated samples. Although we find some evidence this setup could be viable, we also note that many issues remain in its practical application. We discuss both aspects and conclude that future work is warranted.
