Table of Contents
Fetching ...

Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

Bernd Bohnet, Kevin Swersky, Rosanne Liu, Pranjal Awasthi, Azade Nova, Javier Snaider, Hanie Sedghi, Aaron T Parisi, Michael Collins, Angeliki Lazaridou, Orhan Firat, Noah Fiedel

TL;DR

The paper tackles the gap in benchmarks for long-context QA by proposing an automated end-to-end pipeline that generates and evaluates book-based questions using entire books as context. It leverages entity-aware question generation and a pairwise side-by-side evaluation with a Bradley-Terry ranking, aided by AutoAIS for factual grounding. Validations on Les Misérables, The Wild Huntress, and NarrativeQA demonstrate that full-book context enables superior performance and that auto-evaluators can align with human judgments while revealing ground-truth quality issues. The work also analyzes auto-rater biases and cross-model agreement, highlighting the potential and limitations of automated long-context evaluation for advancing LLM reasoning over extended texts.

Abstract

We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Previous efforts to construct such datasets relied on crowd-sourcing, but the emergence of transformers with a context size of 1 million or more tokens now enables entirely automatic approaches. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text, such as questions involving character arcs, broader themes, or the consequences of early actions later in the story. We propose a holistic pipeline for automatic data generation including question generation, answering, and model scoring using an ``Evaluator''. We find that a relative approach, comparing answers between models in a pairwise fashion and ranking with a Bradley-Terry model, provides a more consistent and differentiating scoring mechanism than an absolute scorer that rates answers individually. We also show that LLMs from different model families produce moderate agreement in their ratings. We ground our approach using the manually curated NarrativeQA dataset, where our evaluator shows excellent agreement with human judgement and even finds errors in the dataset. Using our automatic evaluation approach, we show that using an entire book as context produces superior reading comprehension performance compared to baseline no-context (parametric knowledge only) and retrieval-based approaches.

Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

TL;DR

The paper tackles the gap in benchmarks for long-context QA by proposing an automated end-to-end pipeline that generates and evaluates book-based questions using entire books as context. It leverages entity-aware question generation and a pairwise side-by-side evaluation with a Bradley-Terry ranking, aided by AutoAIS for factual grounding. Validations on Les Misérables, The Wild Huntress, and NarrativeQA demonstrate that full-book context enables superior performance and that auto-evaluators can align with human judgments while revealing ground-truth quality issues. The work also analyzes auto-rater biases and cross-model agreement, highlighting the potential and limitations of automated long-context evaluation for advancing LLM reasoning over extended texts.

Abstract

We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Previous efforts to construct such datasets relied on crowd-sourcing, but the emergence of transformers with a context size of 1 million or more tokens now enables entirely automatic approaches. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text, such as questions involving character arcs, broader themes, or the consequences of early actions later in the story. We propose a holistic pipeline for automatic data generation including question generation, answering, and model scoring using an ``Evaluator''. We find that a relative approach, comparing answers between models in a pairwise fashion and ranking with a Bradley-Terry model, provides a more consistent and differentiating scoring mechanism than an absolute scorer that rates answers individually. We also show that LLMs from different model families produce moderate agreement in their ratings. We ground our approach using the manually curated NarrativeQA dataset, where our evaluator shows excellent agreement with human judgement and even finds errors in the dataset. Using our automatic evaluation approach, we show that using an entire book as context produces superior reading comprehension performance compared to baseline no-context (parametric knowledge only) and retrieval-based approaches.
Paper Structure (45 sections, 2 equations, 20 figures, 4 tables)

This paper contains 45 sections, 2 equations, 20 figures, 4 tables.

Figures (20)

  • Figure 1: Overview of our framework. We use LLM-as-a-curator to generate a high-quality dataset, and then LLM-as-an-evaluator to rank the performances of a range of models on this dataset. The whole process incurs very little manual labor from humans, and instead leverages the creation and judgement power of LLMs.
  • Figure 2: Les Miśerables QA-quality ranking.
  • Figure 3: The Wild Huntress QA-quality ranking.
  • Figure 4: Auto-Rater bias analysis. In all matrices System A$=$Gemini 1.5 Pro, and B$=$GPT-4 Turbo.
  • Figure 5: Figure (a) shows the % of times the semantic similarity rater and the AutoAIS$_{T5}$ rater agree with AutoAIS$_{GPT-4}$. Figure (b) shows the % of times the responses of two models (No context GPT-4 Turbo and Full context Gemini 1.5 Pro) are correct as rated by the three raters.
  • ...and 15 more figures