DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels
Zhe Xu, Jiasheng Ye, Xiaoran Liu, Xiangyang Liu, Tianxiang Sun, Zhigeng Liu, Qipeng Guo, Linlin Li, Qun Liu, Xuanjing Huang, Xipeng Qiu
TL;DR
DetectiveQA introduces a long-context, narrative reasoning benchmark derived from detective novels, featuring English and Chinese questions (1200 items) paired with explicit reference reasoning steps and a novel step-wise reasoning metric. The dataset, built from orthodox detective literature with contexts averaging over 100k tokens, enables evaluation of both answer accuracy and the coherence of reasoning chains, including evidence retrieval. Evaluations across GPT-4, Claude, and LLaMA reveal persistent challenges in long-context reasoning and evidence retrieval, with data-contamination analyses and long-vs-short-context comparisons informing the reliability of results. Overall, DetectiveQA provides a rigorous framework for measuring deep, step-wise narrative reasoning in LLMs and highlights gaps that motivate future improvements in long-context understanding and reasoning capabilities.
Abstract
Recently, significant efforts have been devoted to enhancing the long-context capabilities of Large Language Models (LLMs), particularly in long-context reasoning. To facilitate this research, we propose \textbf{DetectiveQA}, a dataset specifically designed for narrative reasoning within long contexts. We leverage detective novels, averaging over 100k tokens, to create a dataset containing 1200 human-annotated questions in both Chinese and English, each paired with corresponding reference reasoning steps. Furthermore, we introduce a step-wise reasoning metric, which enhances the evaluation of LLMs' reasoning processes. We validate our approach and evaluate the mainstream LLMs, including GPT-4, Claude, and LLaMA, revealing persistent long-context reasoning challenges and demonstrating their evidence-retrieval challenges. Our findings offer valuable insights into the study of long-context reasoning and lay the base for more rigorous evaluations.
