Table of Contents
Fetching ...

Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

Amanda Bertsch, Adithya Pratapa, Teruko Mitamura, Graham Neubig, Matthew R. Gormley

TL;DR

Oolong introduces a rigorous benchmark for long-context reasoning and information aggregation, comprising two task sets: Oolong-synth (synthetic, controllable aggregation tasks) and Oolong-real (real-world, long-context transcripts from Critical Role). The tasks require identifying relevant context, performing subtask classifications, and aggregating results across large input windows, with careful data curation, context-window construction, and evaluation methodology. Across both splits, frontier models struggle to achieve high accuracy as context length grows, with temporal reasoning and aggregation bottlenecks identified as central challenges. The authors release data and an evaluation harness to spur progress toward robust long-context aggregation capabilities, highlighting substantial room for improvement in current models.

Abstract

As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.

Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

TL;DR

Oolong introduces a rigorous benchmark for long-context reasoning and information aggregation, comprising two task sets: Oolong-synth (synthetic, controllable aggregation tasks) and Oolong-real (real-world, long-context transcripts from Critical Role). The tasks require identifying relevant context, performing subtask classifications, and aggregating results across large input windows, with careful data curation, context-window construction, and evaluation methodology. Across both splits, frontier models struggle to achieve high accuracy as context length grows, with temporal reasoning and aggregation bottlenecks identified as central challenges. The authors release data and an evaluation harness to spur progress toward robust long-context aggregation capabilities, highlighting substantial room for improvement in current models.

Abstract

As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.

Paper Structure

This paper contains 39 sections, 1 equation, 7 figures, 19 tables.

Figures (7)

  • Figure 1: Oolong poses questions that require performing a multi-step information aggregation process to determine the solution. Oolong-synth uses ICL-based tasks, which could be easily decomposed and solved iteratively, as a proxy for real-world aggregation tasks over long inputs. Oolong-real poses challenging information aggregation questions over transcripts from live-action Dungeons & Dragons shows, which can not be easily decomposed into component pieces.
  • Figure 2: Scores by context window length for Oolong-synth and Oolong-real.
  • Figure 3: Comparison across reasoning levels.
  • Figure 4: The performance trend for models by type of answer and type of task on Oolong-synth.
  • Figure 5: Comparison on Oolong-synth: (a) we provide the gold labels in the input. This leads to a consistent but small improvement, (b) short context performance; while the top models have similar short-context performance, differences emerge as the context length grows.
  • ...and 2 more figures