DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities
Tianyi Zhuang, Chuqiao Kuang, Xiaoguang Li, Yihua Teng, Jihao Wu, Yasheng Wang, Lifeng Shang
TL;DR
DocPuzzle introduces a rigorous, process-aware benchmark for evaluating long-context reasoning in LLMs using 100 expert-annotated, multi-domain QA tasks grounded in long real-world documents. A human–AI collaboration and a checklist-based evaluation decouple reasoning validity from final answers, addressing guessing biases and enabling detailed analysis of intermediate reasoning. Experiments show slow-thinking models outperform typical instruct models, while distilled reasoning transfers poorly to complex long-context tasks, underscoring limits of distillation for advanced reasoning. The benchmark, with its multi-domain scope and validation pipeline, offers a robust framework for benchmarking reasoning processes and guiding future research toward more generalizable, reasoning-rich AI systems.
Abstract
We present DocPuzzle, a rigorously constructed benchmark for evaluating long-context reasoning capabilities in large language models (LLMs). This benchmark comprises 100 expert-level QA problems requiring multi-step reasoning over long real-world documents. To ensure the task quality and complexity, we implement a human-AI collaborative annotation-validation pipeline. DocPuzzle introduces an innovative evaluation framework that mitigates guessing bias through checklist-guided process analysis, establishing new standards for assessing reasoning capacities in LLMs. Our evaluation results show that: 1)Advanced slow-thinking reasoning models like o1-preview(69.7%) and DeepSeek-R1(66.3%) significantly outperform best general instruct models like Claude 3.5 Sonnet(57.7%); 2)Distilled reasoning models like DeepSeek-R1-Distill-Qwen-32B(41.3%) falls far behind the teacher model, suggesting challenges to maintain the generalization of reasoning capabilities relying solely on distillation.
