Table of Contents
Fetching ...

Evidence Units: Ontology-Grounded Document Organization for Parser-Independent Retrieval

Yeonjee Han

Abstract

Structured documents--tables paired with captions, figures with explanations, equations with the paragraphs that interpret them--are routinely fragmented when indexed for retrieval. Element-level indexing treats every parsed element as an independent chunk, scattering semantically cohesive units across separate retrieval candidates. This paper presents a parser-independent pipeline that constructs Evidence Units (EUs): semantically complete document chunks that group visual assets with their contextual text. We introduce four contributions: (1) ontology-grounded role normalization extending DoCO that maps heterogeneous parser outputs to a unified semantic schema; (2) a semantic global assignment algorithm that optimally assigns paragraphs to EUs via a full similarity matrix; (3) a graph-based decision layer in Neo4j that formalizes EU construction rules and validates completeness through two invariants; and (4) cross-parser validation showing EU spatial footprints converge across MinerU and Docling, with gains preserved under parser-induced bbox variance. Experiments on OmniDocBench v1.0 (1,340 pages; 1,551 QA pairs) show EU-based chunking improves retrieval LCS by +0.31 (0.50 to 0.81). Recall@1 increases from 0.15 to 0.51 (3.4x) and MinK decreases from 2.58 to 1.72. Cross-parser results confirm the gain (LCS +0.23 to +0.31) is preserved across parsers. Text queries show the most dramatic gain: Recall@1 rises from 0.08 to 0.47.

Evidence Units: Ontology-Grounded Document Organization for Parser-Independent Retrieval

Abstract

Structured documents--tables paired with captions, figures with explanations, equations with the paragraphs that interpret them--are routinely fragmented when indexed for retrieval. Element-level indexing treats every parsed element as an independent chunk, scattering semantically cohesive units across separate retrieval candidates. This paper presents a parser-independent pipeline that constructs Evidence Units (EUs): semantically complete document chunks that group visual assets with their contextual text. We introduce four contributions: (1) ontology-grounded role normalization extending DoCO that maps heterogeneous parser outputs to a unified semantic schema; (2) a semantic global assignment algorithm that optimally assigns paragraphs to EUs via a full similarity matrix; (3) a graph-based decision layer in Neo4j that formalizes EU construction rules and validates completeness through two invariants; and (4) cross-parser validation showing EU spatial footprints converge across MinerU and Docling, with gains preserved under parser-induced bbox variance. Experiments on OmniDocBench v1.0 (1,340 pages; 1,551 QA pairs) show EU-based chunking improves retrieval LCS by +0.31 (0.50 to 0.81). Recall@1 increases from 0.15 to 0.51 (3.4x) and MinK decreases from 2.58 to 1.72. Cross-parser results confirm the gain (LCS +0.23 to +0.31) is preserved across parsers. Text queries show the most dramatic gain: Recall@1 rises from 0.08 to 0.47.

Paper Structure

This paper contains 49 sections, 6 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: EU Construction Pipeline. Stage 1: ontology-grounded node normalization via cascaded role assignment (pattern matching $\to$ TYPE_MAP $\to$ embedding fallback). Stage 2: three-phase EU construction (Phase A: visual-core formation; Phase B: global semantic allocation; Phase C: residual consolidation). EU output feeds downstream applications: RAG retrieval, LLM context assembly, and knowledge graph ingestion.
  • Figure 2: EU spatial footprint convergence across parsers. Each column shows how a different parser decomposes the same table region: Parser A emits one bbox; Docling emits three row-level bboxes that merge in Phase A; PaddleOCR-VL emits a single VLM-inferred bbox with $\pm$0.02 positional error (shown in red). Despite these differences, the EU footprint (yellow dashed outline)---the bounding box of all EU members including section header, caption, unit label, and adjacent paragraphs--- converges to the same page region across all parsers. The Docling case achieves IoU = 0.88 due to an attachment range effect on the trailing paragraph; all other parsers achieve IoU = 1.00.
  • Figure 3: Cross-parser evaluation on OmniDocBench (1,551 QA pairs, Strict protocol). Left: Recall@K curves for GT, MinerU, and Docling tracks, with and without EU. Right: LCS by evidence source (table, figure, text) across all three parsers. The EU improvement ($\Delta$LCS $\approx$ +0.23--+0.31) is consistent across all parsers, and both MinerU w/ EU and Docling w/ EU match or exceed GT w/o EU on all sources.
  • Figure 4: Case study on an actual OmniDocBench page (ceat.200600266omnidocbench). (1) Parser output: six individually detected layout elements, each stored as a separate retrieval chunk. (2) EU grouping: the same elements organised into three EUs---EU-B (table_panel) co-locates the table caption, table body, and surrounding context into one unit. (3) Retrieval: w/o EU retrieves the caption at rank 1 (sim=1.00) but the table body is a lower-ranked separate chunk, making the answer unreachable; w/ EU retrieves EU-B at rank 1, returning the complete evidence in a single retrieval.