Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction
Tianyun Zhong, Guozhao Mo, Yanjiang Liu, Yihan Chen, Lingdi Kong, Xuanang Chen, Yaojie Lu, Hongyu Lin, Shiwei Ye, Xianpei Han, Ben He, Le Sun
TL;DR
The paper presents AOe, a bilingual benchmark that evaluates the ability of LLMs to reconstruct organized tables from long, fragmented real-world documents across Academic, Legal, and Financial domains. By introducing a hierarchical evaluation framework with offline and online settings and a ground-truth annotation process, AOe measures structural parsability, overall quality, and cell-level content accuracy (Cell F1) using dynamic schema construction. Experimental results reveal a pervasive Area of Effect, with models displaying an illusion of competence (high structure but low factual accuracy) and agentic systems facing gridlock from retrieval and resource constraints. The work highlights the challenges of multi-document structured knowledge construction and motivates future research toward reliable, end-to-end knowledge extraction that combines dynamic schema induction with precise content grounding. The benchmark, available under Apache 2.0, provides a rigorous testbed for advancing trustworthy, table-based knowledge extraction in real-world documents.
Abstract
With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at https://anonymous.4open.science/r/AOE-Benchmark/.
