Table of Contents
Fetching ...

TuringAdvice: A Generative and Dynamic Evaluation of Language Use

Rowan Zellers, Ari Holtzman, Elizabeth Clark, Lianhui Qin, Ali Farhadi, Yejin Choi

TL;DR

This work introduces TuringAdvice, a framework and dynamic dataset (RedditAdvice) to evaluate language understanding through open-ended advice-giving tasks. By tying evaluation to human utility rather than static correctness, it reveals large gaps between state-of-the-art models and human performance, even when models are fine-tuned on extensive in-domain data. The authors implement a dynamic leader-board and a hybrid Mechanical Turk workflow to assess model-generated advice against Reddit-endorsed human advice, uncovering systematic failures such as contradictions and toxic outputs. The study highlights the need for diagnostic measures and real-world, context-aware evaluation to drive progress toward truly grounded natural language understanding and provides a path forward for safer deployment in real-world advisory settings.

Abstract

We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding: our ability to use language to resolve open-ended situations by communicating with each other. Empirical results show that today's models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples. The best model, a finetuned T5, writes advice that is at least as helpful as human-written advice in only 14% of cases; a much larger non-finetunable GPT3 model does even worse at 4%. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

TuringAdvice: A Generative and Dynamic Evaluation of Language Use

TL;DR

This work introduces TuringAdvice, a framework and dynamic dataset (RedditAdvice) to evaluate language understanding through open-ended advice-giving tasks. By tying evaluation to human utility rather than static correctness, it reveals large gaps between state-of-the-art models and human performance, even when models are fine-tuned on extensive in-domain data. The authors implement a dynamic leader-board and a hybrid Mechanical Turk workflow to assess model-generated advice against Reddit-endorsed human advice, uncovering systematic failures such as contradictions and toxic outputs. The study highlights the need for diagnostic measures and real-world, context-aware evaluation to drive progress toward truly grounded natural language understanding and provides a path forward for safer deployment in real-world advisory settings.

Abstract

We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding: our ability to use language to resolve open-ended situations by communicating with each other. Empirical results show that today's models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples. The best model, a finetuned T5, writes advice that is at least as helpful as human-written advice in only 14% of cases; a much larger non-finetunable GPT3 model does even worse at 4%. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

Paper Structure

This paper contains 39 sections, 3 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: TuringAdvice. Humans are natural experts at using language to successfully address situations that arise, such as giving advice. We introduce a new framework, dataset, and leaderboard to generatively evaluate real-world language use. Today's most powerful models -- which obtain near-human or superhuman performance on core NLP benchmarks for reading comprehension, natural language inference, and commonsense reasoning -- struggle with all of these capabilities when generating advice, as foomaraschino!10 highlighted in red.
  • Figure 2: Crowdsourcing workflow. Mechanical Turk Workers are given a situation, and two pieces of advice. First, they choose which is more helpful (here, B). Second, they rate the helpfulness of the worse advice (A); last, they answer a diagnostic question.
  • Figure 3: Helpfulness of models relative to top-scoring Reddit advice. We show results over 200 shared situations; we also show bootstrapped 95% confidence intervals. Advice from the best-scoring model, T5-11B, is preferred 14.5% over top-scoring Reddit advice. We also compare the second-top scoring piece of Reddit advice, which scores 41% -- worse than the best advice (50% by definition), but better than any model.
  • Figure 4: Improvement (in absolute percentage $\%$) between pairs of models, along with statistical significance from a paired t-test. The improvement of T5-11B over smaller models like Grover-Mega is highly statistically significant (10% gap, $p{<}.01$), while being far worse than human performance. Our evaluation thus meaningfully grades varying levels of performance.
  • Figure 5: Qualitative example; more in Supp. \ref{['supp:morequalex']}. Though machine-generated advice matches keywords from the situation, it is frequently not helpful or even self-contradictory. The issues are due to critical errors in natural language understanding, such as reading comprehension, entailment, and coreference.
  • ...and 14 more figures