Table of Contents
Fetching ...

Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction

Tianyun Zhong, Guozhao Mo, Yanjiang Liu, Yihan Chen, Lingdi Kong, Xuanang Chen, Yaojie Lu, Hongyu Lin, Shiwei Ye, Xianpei Han, Ben He, Le Sun

TL;DR

The paper presents AOe, a bilingual benchmark that evaluates the ability of LLMs to reconstruct organized tables from long, fragmented real-world documents across Academic, Legal, and Financial domains. By introducing a hierarchical evaluation framework with offline and online settings and a ground-truth annotation process, AOe measures structural parsability, overall quality, and cell-level content accuracy (Cell F1) using dynamic schema construction. Experimental results reveal a pervasive Area of Effect, with models displaying an illusion of competence (high structure but low factual accuracy) and agentic systems facing gridlock from retrieval and resource constraints. The work highlights the challenges of multi-document structured knowledge construction and motivates future research toward reliable, end-to-end knowledge extraction that combines dynamic schema induction with precise content grounding. The benchmark, available under Apache 2.0, provides a rigorous testbed for advancing trustworthy, table-based knowledge extraction in real-world documents.

Abstract

With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at https://anonymous.4open.science/r/AOE-Benchmark/.

Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction

TL;DR

The paper presents AOe, a bilingual benchmark that evaluates the ability of LLMs to reconstruct organized tables from long, fragmented real-world documents across Academic, Legal, and Financial domains. By introducing a hierarchical evaluation framework with offline and online settings and a ground-truth annotation process, AOe measures structural parsability, overall quality, and cell-level content accuracy (Cell F1) using dynamic schema construction. Experimental results reveal a pervasive Area of Effect, with models displaying an illusion of competence (high structure but low factual accuracy) and agentic systems facing gridlock from retrieval and resource constraints. The work highlights the challenges of multi-document structured knowledge construction and motivates future research toward reliable, end-to-end knowledge extraction that combines dynamic schema induction with precise content grounding. The benchmark, available under Apache 2.0, provides a rigorous testbed for advancing trustworthy, table-based knowledge extraction in real-world documents.

Abstract

With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at https://anonymous.4open.science/r/AOE-Benchmark/.

Paper Structure

This paper contains 79 sections, 2 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: A comparison of AOE and previous tasks. Unlike (a) extracting "Isolated Dots" or (b) generating "Unverifiable Output," (c) our benchmark requires constructing "Organized Evidence" for verifiable analysis.
  • Figure 2: Construction process of our AOE Benchmark.
  • Figure 3: Overview of the Automated Evaluation Pipeline. The pipeline evaluates generated tables from three perspectives: (1) CSV Parsability for basic structural correctness; (2) Overall Quality via an LLM evaluator for a holistic score; and (3) Content Evaluation to calculate a Cell F1 score based on cell-level comparison after column and row alignment.
  • Figure 4: Agentic models show significantly low Row F1 in retrieval-intensive scenarios (e.g., Financial), whereas the performance gap narrows in domains amenable to precise search (e.g., Academic).
  • Figure 5: Distribution of termination reasons in Tongyi-DeepResearch, categorized by successful (Pass Rate=1) and failed (Pass Rate=0) outcomes. Resource exhaustion, combining API call and token limits, is the dominant failure mode, accounting for 73.3% (173 out of 236) of all failures.
  • ...and 2 more figures