TopoChunker: Topology-Aware Agentic Document Chunking Framework

Xiaoyu Liu

TopoChunker: Topology-Aware Agentic Document Chunking Framework

Xiaoyu Liu

Abstract

Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.

TopoChunker: Topology-Aware Agentic Document Chunking Framework

Abstract

Paper Structure (29 sections, 8 figures, 3 tables)

This paper contains 29 sections, 8 figures, 3 tables.

Introduction
Related Work
Granularity Evolution: From Heuristic Rules to Semantic Sensitivity
Hierarchical Representation and Context Augmentation in RAG
Agentic Workflows and Adaptive Diagnostics for Document Intelligence
Method
Cognitive Perception and Adaptive Routing
Active Probing and Path Selection
Topological Modeling and SIR Construction
Semantic Refinement and Context Assembly
Capacity Auditing and Semantic Slicing
Semantic Signature Generation
Topological Context Disambiguation
Experiments
Experimental Setup
...and 14 more sections

Figures (8)

Figure 1: The overall architecture of TopoChunker. The dual-agent framework operates across three layers: Cognitive Perception (Inspector Agent for adaptive routing), Execution (Topological Pruner and SIR construction), and Audit (Refiner Agent for context disambiguation). These layers interact via a shared Structured Intermediate Representation (SIR) to form a closed-loop Diagnosis-Execution-Audit pipeline.
Figure 2: Generation Accuracy. A heatmap illustrating the generation accuracy of various chunking methods across the GutenQA and GovReport datasets. Darker shades indicate higher accuracy.
Figure 3: Complexity-Aware Token Consumption Analysis. TopoChunker dynamically adjusts token expenditure based on document complexity, achieving lower average costs than static agentic baselines.
Figure 3: System Prompts for the Inspector and Refiner Agents.
Figure 4: Qualitative Examples of TopoChunker output resolving contextual fragmentation. (a) On the GutenQA dataset, the lack of heading hierarchy triggers Path 2: Semantic Flow. TopoChunker actively resolves semantic islands by explicitly mapping dangling pronouns to their ancestral entities in the Context Supplement. (b) On the GovReport dataset, explicit hierarchies trigger Path 1: Structural Rule. Generic references are successfully resolved to specific entities, ensuring chunks remain semantically self-contained for downstream RAG.
...and 3 more figures

TopoChunker: Topology-Aware Agentic Document Chunking Framework

Abstract

TopoChunker: Topology-Aware Agentic Document Chunking Framework

Authors

Abstract

Table of Contents

Figures (8)