Table of Contents
Fetching ...

What Question Answering can Learn from Trivia Nerds

Jordan Boyd-Graber, Benjamin Börschinger

TL;DR

The paper reframes QA evaluation as a trivia tournament problem, arguing that long-standing trivia practices—playtesting, unambiguous question writing, discriminative scoring, and transparent adjudication—can address current QA leaderboard flaws. It advocates adopting Quizbowl-inspired properties (interruptability, pyramidal clue structure, and rigorous editing) to improve discriminability and reduce annotation and evaluation biases. By introducing concepts like the effective dataset proportion $\rho$ and emphasizing multiple metrics and stakeholder perspectives, the authors propose practical steps (collaboration with trivia experts, self-play testing, data transparency, and data-driven adjudication) to build more faithful, robust QA benchmarks. The work highlights potential impact for real-world QA systems by producing fairer leaderboards, richer error analysis, and a healthier data ecosystem that bridges academic QA and the trivia community. $\rho$ and $N$ appear as key design variables in their simulation framework to quantify how dataset discriminativeness influences required test-set sizes for reliable comparisons.

Abstract

In addition to the traditional task of getting machines to answer questions, a major research question in question answering is to create interesting, challenging questions that can help systems learn how to answer questions and also reveal which systems are the best at answering questions. We argue that creating a question answering dataset -- and the ubiquitous leaderboard that goes with it -- closely resembles running a trivia tournament: you write questions, have agents (either humans or machines) answer the questions, and declare a winner. However, the research community has ignored the decades of hard-learned lessons from decades of the trivia community creating vibrant, fair, and effective question answering competitions. After detailing problems with existing QA datasets, we outline the key lessons -- removing ambiguity, discriminating skill, and adjudicating disputes -- that can transfer to QA research and how they might be implemented for the QA community.

What Question Answering can Learn from Trivia Nerds

TL;DR

The paper reframes QA evaluation as a trivia tournament problem, arguing that long-standing trivia practices—playtesting, unambiguous question writing, discriminative scoring, and transparent adjudication—can address current QA leaderboard flaws. It advocates adopting Quizbowl-inspired properties (interruptability, pyramidal clue structure, and rigorous editing) to improve discriminability and reduce annotation and evaluation biases. By introducing concepts like the effective dataset proportion and emphasizing multiple metrics and stakeholder perspectives, the authors propose practical steps (collaboration with trivia experts, self-play testing, data transparency, and data-driven adjudication) to build more faithful, robust QA benchmarks. The work highlights potential impact for real-world QA systems by producing fairer leaderboards, richer error analysis, and a healthier data ecosystem that bridges academic QA and the trivia community. and appear as key design variables in their simulation framework to quantify how dataset discriminativeness influences required test-set sizes for reliable comparisons.

Abstract

In addition to the traditional task of getting machines to answer questions, a major research question in question answering is to create interesting, challenging questions that can help systems learn how to answer questions and also reveal which systems are the best at answering questions. We argue that creating a question answering dataset -- and the ubiquitous leaderboard that goes with it -- closely resembles running a trivia tournament: you write questions, have agents (either humans or machines) answer the questions, and declare a winner. However, the research community has ignored the decades of hard-learned lessons from decades of the trivia community creating vibrant, fair, and effective question answering competitions. After detailing problems with existing QA datasets, we outline the key lessons -- removing ambiguity, discriminating skill, and adjudicating disputes -- that can transfer to QA research and how they might be implemented for the QA community.

Paper Structure

This paper contains 32 sections, 1 equation, 2 figures.

Figures (2)

  • Figure 1: Two datasets with $0.16$ annotation error, but the top better discriminates qa ability. In the good dataset (top), most questions are challenging but not impossible. In the bad dataset (bottom), there are more trivial or impossible questions and annotation error is concentrated on the challenging, discriminative questions. Thus, a smaller fraction of questions decide who sits atop the leaderboard, requiring a larger test set.
  • Figure 2: How much test data do you need to discriminate two systems with 95% confidence? This depends on both the difference in accuracy between the systems ($x$ axis) and the average accuracy of the systems (closer to 50% is harder). Test set creators do not have much control over those. They do have control, however, over how many questions are discriminative. If all questions are discriminative (right), you only need 2500 questions, but if three quarters of your questions are too easy, too hard, or have annotation errors (left), you'll need 15000.