Table of Contents
Fetching ...

DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels

Zhe Xu, Jiasheng Ye, Xiaoran Liu, Xiangyang Liu, Tianxiang Sun, Zhigeng Liu, Qipeng Guo, Linlin Li, Qun Liu, Xuanjing Huang, Xipeng Qiu

TL;DR

DetectiveQA introduces a long-context, narrative reasoning benchmark derived from detective novels, featuring English and Chinese questions (1200 items) paired with explicit reference reasoning steps and a novel step-wise reasoning metric. The dataset, built from orthodox detective literature with contexts averaging over 100k tokens, enables evaluation of both answer accuracy and the coherence of reasoning chains, including evidence retrieval. Evaluations across GPT-4, Claude, and LLaMA reveal persistent challenges in long-context reasoning and evidence retrieval, with data-contamination analyses and long-vs-short-context comparisons informing the reliability of results. Overall, DetectiveQA provides a rigorous framework for measuring deep, step-wise narrative reasoning in LLMs and highlights gaps that motivate future improvements in long-context understanding and reasoning capabilities.

Abstract

Recently, significant efforts have been devoted to enhancing the long-context capabilities of Large Language Models (LLMs), particularly in long-context reasoning. To facilitate this research, we propose \textbf{DetectiveQA}, a dataset specifically designed for narrative reasoning within long contexts. We leverage detective novels, averaging over 100k tokens, to create a dataset containing 1200 human-annotated questions in both Chinese and English, each paired with corresponding reference reasoning steps. Furthermore, we introduce a step-wise reasoning metric, which enhances the evaluation of LLMs' reasoning processes. We validate our approach and evaluate the mainstream LLMs, including GPT-4, Claude, and LLaMA, revealing persistent long-context reasoning challenges and demonstrating their evidence-retrieval challenges. Our findings offer valuable insights into the study of long-context reasoning and lay the base for more rigorous evaluations.

DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels

TL;DR

DetectiveQA introduces a long-context, narrative reasoning benchmark derived from detective novels, featuring English and Chinese questions (1200 items) paired with explicit reference reasoning steps and a novel step-wise reasoning metric. The dataset, built from orthodox detective literature with contexts averaging over 100k tokens, enables evaluation of both answer accuracy and the coherence of reasoning chains, including evidence retrieval. Evaluations across GPT-4, Claude, and LLaMA reveal persistent challenges in long-context reasoning and evidence retrieval, with data-contamination analyses and long-vs-short-context comparisons informing the reliability of results. Overall, DetectiveQA provides a rigorous framework for measuring deep, step-wise narrative reasoning in LLMs and highlights gaps that motivate future improvements in long-context understanding and reasoning capabilities.

Abstract

Recently, significant efforts have been devoted to enhancing the long-context capabilities of Large Language Models (LLMs), particularly in long-context reasoning. To facilitate this research, we propose \textbf{DetectiveQA}, a dataset specifically designed for narrative reasoning within long contexts. We leverage detective novels, averaging over 100k tokens, to create a dataset containing 1200 human-annotated questions in both Chinese and English, each paired with corresponding reference reasoning steps. Furthermore, we introduce a step-wise reasoning metric, which enhances the evaluation of LLMs' reasoning processes. We validate our approach and evaluate the mainstream LLMs, including GPT-4, Claude, and LLaMA, revealing persistent long-context reasoning challenges and demonstrating their evidence-retrieval challenges. Our findings offer valuable insights into the study of long-context reasoning and lay the base for more rigorous evaluations.
Paper Structure (33 sections, 10 figures, 8 tables)

This paper contains 33 sections, 10 figures, 8 tables.

Figures (10)

  • Figure 1: An example of annotation in DetectiveQA. We highlight the explicit evidence of reasoning in blue and implicit evidence in green. The whole reference steps include both. In contrast, in the Evidence Position field, the part corresponding to the explicit evidence will be the paragraph index in the novel, while that corresponding to the implicit evidence will be -1.
  • Figure 2: Illustration of DetectiveQA. The center shows the main annotation process, where human annotators annotate reasoning problems based on various information. On the left, AI-assisted information extraction offers summaries to help annotators quickly understand novels and locate key information. The right side, the most critical part, involves evaluating models using DetectiveQA, where reasoning metric and answer accuracy are measured.
  • Figure 3: Reasoning metrics with its illustration and three settings.
  • Figure 4: The distribution of the context tokens of samples in DetectiveQA. The novel content for each question is truncated before the answer appears.
  • Figure 5: Multi-needle-in-a-haystack test results for different models. We treat each clue in the reference steps found in the article as a "needle" and determine whether the needle is detected by checking if it is included in the model's reasoning process. We define "depth" as the percentage of the problem's total character count where the evidence appears far from the beginning of the document. Our analysis focuses on recall based on varying context lengths and clue depths.
  • ...and 5 more figures