Table of Contents
Fetching ...

One Thousand and One Pairs: A "novel" challenge for long-context language models

Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, Mohit Iyyer

TL;DR

NoCha introduces a 1,001 narrative minimal-pair dataset of 67 recently published English fiction titles to stress long-context reasoning in claim verification. Unlike surface-retrieval benchmarks, NoCha requires global synthesis across book-length narratives, revealing a substantial gap between human readers and current long-context models (GPT-4o at 55.8% accuracy; open-weight models near random). The study analyzes evidence scope, world-building complexity, and explanations, showing that model justifications are often flawed and retrieval-augmented approaches offer limited gains. By providing a scalable data-collection and evaluation methodology, NoCha offers a framework for evolving the benchmark and benchmarking future long-context systems in a more realistic setting.

Abstract

Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4o achieves the highest accuracy at 55.8%. Further analysis reveals that (1) on average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning; (2) model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims; and (3) models perform substantially worse on speculative fiction books that contain extensive world-building. The methodology proposed in NoCha allows for the evolution of the benchmark dataset and the easy analysis of future models.

One Thousand and One Pairs: A "novel" challenge for long-context language models

TL;DR

NoCha introduces a 1,001 narrative minimal-pair dataset of 67 recently published English fiction titles to stress long-context reasoning in claim verification. Unlike surface-retrieval benchmarks, NoCha requires global synthesis across book-length narratives, revealing a substantial gap between human readers and current long-context models (GPT-4o at 55.8% accuracy; open-weight models near random). The study analyzes evidence scope, world-building complexity, and explanations, showing that model justifications are often flawed and retrieval-augmented approaches offer limited gains. By providing a scalable data-collection and evaluation methodology, NoCha offers a framework for evolving the benchmark and benchmarking future long-context systems in a more realistic setting.

Abstract

Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4o achieves the highest accuracy at 55.8%. Further analysis reveals that (1) on average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning; (2) model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims; and (3) models perform substantially worse on speculative fiction books that contain extensive world-building. The methodology proposed in NoCha allows for the evolution of the benchmark dataset and the easy analysis of future models.

Paper Structure

This paper contains 60 sections, 14 figures, 42 tables.

Figures (14)

  • Figure 1: Overview of NoCha's data collection and evaluation pipeline. Readers create true/false claim pairs for books published between 2023 and 2024 with written justifications for the false labels. Each model is given the full book as context and evaluates one claim at a time. We measure pair accuracy, where the model must identify the true claim as true and the false claim as false to receive credit. This approach helps guard against label bias while also better assessing the true understanding of the text, as both claims pertain to the same events or parts of the story. Our books range from 49k to 336k tokens, and each model is tested on a subset of books that fit within its context.
  • Figure 2: Examples of claim pairs where the models failed to validate one of the claims in the pair. Employing narrative minimal pairs helps us avoid awarding the model for cases where it only appears to produce the correct answer, when in fact the prediction was made without fully or efficiently utilizing the context. In the top example, the model first correctly identifies hints dropped by the author about how the first victim dies (when verifying the true claim), then incorrectly claims that no such hints exist (when verifying the false claim). In the bottom example, the model first incorrectly claims that the key given to Nuna was not found in Rona's wooden chest (when verifying the true claim), then correctly raises no objections to the fact that the key was found in Rona's wooden chest (when verifying the false claim).
  • Figure 3: Performance of closed-source models on different types of novels. Note that two novels were excluded from this analysis as they could not clearly be classified in one of these categories. We provide the total number of claims in each category for reference however these numbers will vary slightly between the models due to the context-limitation and Gemini Pro 1.5/ Gemini Flash 1.5 refusals.
  • Figure 4: Model accuracy on claim pairs for different stories within collections. Accuracy is shown for (1) using the entire collection as context when prompting about a story, and (2) using only the individual story ("story") for the same set of claims. For GPT-4o and GPT-4-Turbo, one book was too long, so only the "story" context performance is presented. Gemini Pro 1.5 and Gemini Flash 1.5 refused to process two books but handled the stories within, so only "story" context performance is available. We also provide performance of Mixtral-8x22B (65k) and Qwen-2-72B (32k) on story-level input, for comparison.
  • Figure 5: Genre distribution in NoCha. As a book can belong to multiple genres, such as fantasy and romance, we allow up to three labels per book.
  • ...and 9 more figures