Table of Contents
Fetching ...

TopoChunker: Topology-Aware Agentic Document Chunking Framework

Xiaoyu Liu

Abstract

Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.

TopoChunker: Topology-Aware Agentic Document Chunking Framework

Abstract

Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.
Paper Structure (29 sections, 8 figures, 3 tables)

This paper contains 29 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The overall architecture of TopoChunker. The dual-agent framework operates across three layers: Cognitive Perception (Inspector Agent for adaptive routing), Execution (Topological Pruner and SIR construction), and Audit (Refiner Agent for context disambiguation). These layers interact via a shared Structured Intermediate Representation (SIR) to form a closed-loop Diagnosis-Execution-Audit pipeline.
  • Figure 2: Generation Accuracy. A heatmap illustrating the generation accuracy of various chunking methods across the GutenQA and GovReport datasets. Darker shades indicate higher accuracy.
  • Figure 3: Complexity-Aware Token Consumption Analysis. TopoChunker dynamically adjusts token expenditure based on document complexity, achieving lower average costs than static agentic baselines.
  • Figure 3: System Prompts for the Inspector and Refiner Agents.
  • Figure 4: Qualitative Examples of TopoChunker output resolving contextual fragmentation. (a) On the GutenQA dataset, the lack of heading hierarchy triggers Path 2: Semantic Flow. TopoChunker actively resolves semantic islands by explicitly mapping dangling pronouns to their ancestral entities in the Context Supplement. (b) On the GovReport dataset, explicit hierarchies trigger Path 1: Structural Rule. Generic references are successfully resolved to specific entities, ensuring chunks remain semantically self-contained for downstream RAG.
  • ...and 3 more figures