Table of Contents
Fetching ...

DebateBench: A Challenging Long Context Reasoning Benchmark For Large Language Models

Utkarsh Tiwari, Aryan Seth, Adi Mukherjee, Kaavya Mer, Kavish, Dhruv Kumar

TL;DR

DebateBench addresses a critical gap in long-context natural language reasoning by providing a high-quality, argument-centric benchmark built from British Parliamentary debates. The dataset comprises 32 debates with 256 speeches (~36 hours total, ~32k tokens per input), transcripts and metadata, plus an adjudication-grounded evaluation framework across three tasks: Verdict Prediction, Speaker Scores, and Speaker Ranks, all aligned to human ground truth. The authors test three LLMs (o1, GPT-4o, Claude Haiku 3.5) under a zero-temperature, in-context-learning regime with the official WUDC judging manual as context, revealing that current models struggle to perform accurate, structured reasoning over such long contexts. Key findings show that while some models excel at ranking, all exhibit sizable errors on verdict and speaker-specific tasks, underscoring the need for improved long-context reasoning techniques, better alignment with human judgments, and potential extensions to bias analysis and richer argument annotations. DebateBench thus offers a rigorous, scalable platform for evaluating in-context learning, argumentation quality, and human-alignment aspects of large language models in real-world, lengthy discourse.

Abstract

We introduce DebateBench, a novel dataset consisting of an extensive collection of transcripts and metadata from some of the world's most prestigious competitive debates. The dataset consists of British Parliamentary debates from prestigious debating tournaments on diverse topics, annotated with detailed speech-level scores and house rankings sourced from official adjudication data. We curate 256 speeches across 32 debates with each debate being over 1 hour long with each input being an average of 32,000 tokens. Designed to capture long-context, large-scale reasoning tasks, DebateBench provides a benchmark for evaluating modern large language models (LLMs) on their ability to engage in argumentation, deliberation, and alignment with human experts. To do well on DebateBench, the LLMs must perform in-context learning to understand the rules and evaluation criteria of the debates, then analyze 8 seven minute long speeches and reason about the arguments presented by all speakers to give the final results. Our preliminary evaluation using GPT o1, GPT-4o, and Claude Haiku, shows that LLMs struggle to perform well on DebateBench, highlighting the need to develop more sophisticated techniques for improving their performance.

DebateBench: A Challenging Long Context Reasoning Benchmark For Large Language Models

TL;DR

DebateBench addresses a critical gap in long-context natural language reasoning by providing a high-quality, argument-centric benchmark built from British Parliamentary debates. The dataset comprises 32 debates with 256 speeches (~36 hours total, ~32k tokens per input), transcripts and metadata, plus an adjudication-grounded evaluation framework across three tasks: Verdict Prediction, Speaker Scores, and Speaker Ranks, all aligned to human ground truth. The authors test three LLMs (o1, GPT-4o, Claude Haiku 3.5) under a zero-temperature, in-context-learning regime with the official WUDC judging manual as context, revealing that current models struggle to perform accurate, structured reasoning over such long contexts. Key findings show that while some models excel at ranking, all exhibit sizable errors on verdict and speaker-specific tasks, underscoring the need for improved long-context reasoning techniques, better alignment with human judgments, and potential extensions to bias analysis and richer argument annotations. DebateBench thus offers a rigorous, scalable platform for evaluating in-context learning, argumentation quality, and human-alignment aspects of large language models in real-world, lengthy discourse.

Abstract

We introduce DebateBench, a novel dataset consisting of an extensive collection of transcripts and metadata from some of the world's most prestigious competitive debates. The dataset consists of British Parliamentary debates from prestigious debating tournaments on diverse topics, annotated with detailed speech-level scores and house rankings sourced from official adjudication data. We curate 256 speeches across 32 debates with each debate being over 1 hour long with each input being an average of 32,000 tokens. Designed to capture long-context, large-scale reasoning tasks, DebateBench provides a benchmark for evaluating modern large language models (LLMs) on their ability to engage in argumentation, deliberation, and alignment with human experts. To do well on DebateBench, the LLMs must perform in-context learning to understand the rules and evaluation criteria of the debates, then analyze 8 seven minute long speeches and reason about the arguments presented by all speakers to give the final results. Our preliminary evaluation using GPT o1, GPT-4o, and Claude Haiku, shows that LLMs struggle to perform well on DebateBench, highlighting the need to develop more sophisticated techniques for improving their performance.

Paper Structure

This paper contains 15 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Performance of models on DebateBench, the y-axis represents the mean absolute error (MAE) of the three tasks. More details in \ref{['section:eval']}
  • Figure 2: The system prompt explaining the format of the debate as well as the metrics of judgment (a) along with the information slide (if present) and the motion (b) and the transcript of the debate (c) which contains 8 speeches by 4 teams (or houses) is passed to the model. The model is tested on 3 tasks (d) and the output is compared to the results given by trained judges to compute the task scores for the model (e).
  • Figure 3: Model accuracy for speaker score prediction at varying delta windows from ground truth
  • Figure 4: The judging manual is adapted from the WUDC judging manual and contains 15,361 words. The entire prompt, including the judging manual, can be found in the code repository.
  • Figure 5: List of debate rounds included in the dataset.