Table of Contents
Fetching ...

PT-RAG: Structure-Fidelity Retrieval-Augmented Generation for Academic Papers

Rui Yu, Tianyi Wang, Ruixia Liu, Yinglong Wang

TL;DR

PT-RAG is proposed, an RAG framework that treats the native hierarchical structure of academic papers as a low-entropy retrieval prior and provides a native low-entropy structural basis for subsequent retrieval, and introduces entropy-based structural diagnostics that quantify retrieval fragmentation and evidence allocation accuracy.

Abstract

Retrieval-augmented generation (RAG) is increasingly applied to question-answering over long academic papers, where accurate evidence allocation under a fixed token budget is critical. Existing approaches typically flatten academic papers into unstructured chunks during preprocessing, which destroys the native hierarchical structure. This loss forces retrieval to operate in a disordered space, thereby producing fragmented contexts, misallocating tokens to non-evidential regions under finite token budgets, and increasing the reasoning burden for downstream language models. To address these issues, we propose PT-RAG, an RAG framework that treats the native hierarchical structure of academic papers as a low-entropy retrieval prior. PT-RAG first inherits the native hierarchy to construct a structure-fidelity PaperTree index, which prevents entropy increase at the source. It then designs a path-guided retrieval mechanism that aligns query semantics to relevant sections and selects high relevance root-to-leaf paths under a fixed token budget, yielding compact, coherent, and low-entropy retrieval contexts. In contrast to existing RAG approaches, PT-RAG avoids entropy increase caused by destructive preprocessing and provides a native low-entropy structural basis for subsequent retrieval. To assess this design, we introduce entropy-based structural diagnostics that quantify retrieval fragmentation and evidence allocation accuracy. On three academic question-answering benchmarks, PT-RAG achieves consistently lower section entropy and evidence alignment cross entropy than strong baselines, indicating reduced context fragmentation and more precise allocation to evidential regions. These structural advantages directly translate into higher answer quality.

PT-RAG: Structure-Fidelity Retrieval-Augmented Generation for Academic Papers

TL;DR

PT-RAG is proposed, an RAG framework that treats the native hierarchical structure of academic papers as a low-entropy retrieval prior and provides a native low-entropy structural basis for subsequent retrieval, and introduces entropy-based structural diagnostics that quantify retrieval fragmentation and evidence allocation accuracy.

Abstract

Retrieval-augmented generation (RAG) is increasingly applied to question-answering over long academic papers, where accurate evidence allocation under a fixed token budget is critical. Existing approaches typically flatten academic papers into unstructured chunks during preprocessing, which destroys the native hierarchical structure. This loss forces retrieval to operate in a disordered space, thereby producing fragmented contexts, misallocating tokens to non-evidential regions under finite token budgets, and increasing the reasoning burden for downstream language models. To address these issues, we propose PT-RAG, an RAG framework that treats the native hierarchical structure of academic papers as a low-entropy retrieval prior. PT-RAG first inherits the native hierarchy to construct a structure-fidelity PaperTree index, which prevents entropy increase at the source. It then designs a path-guided retrieval mechanism that aligns query semantics to relevant sections and selects high relevance root-to-leaf paths under a fixed token budget, yielding compact, coherent, and low-entropy retrieval contexts. In contrast to existing RAG approaches, PT-RAG avoids entropy increase caused by destructive preprocessing and provides a native low-entropy structural basis for subsequent retrieval. To assess this design, we introduce entropy-based structural diagnostics that quantify retrieval fragmentation and evidence allocation accuracy. On three academic question-answering benchmarks, PT-RAG achieves consistently lower section entropy and evidence alignment cross entropy than strong baselines, indicating reduced context fragmentation and more precise allocation to evidential regions. These structural advantages directly translate into higher answer quality.
Paper Structure (30 sections, 8 equations, 4 figures, 5 tables)

This paper contains 30 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Structure-Fidelity & RAG. Current RAG methods flatten documents into unordered chunks, losing section signals and leading to fragmented, inaccurate retrieval. In contrast, humans naturally navigate papers via section hierarchies to localize evidence efficiently. Inspired by this, PT-RAG constructs a PaperTree index that preserves the native outline and performs path-guided retrieval, enabling precise, low-entropy context assembly under token budgets.
  • Figure 2: Overview of the PT-RAG Framework. Our framework builds a PaperTree index via structure-anchored segmentation and summarization, enabling path-guided retrieval of coherent, low-entropy contexts for accurate answer generation. (a) PT-RAG Pipeline: illustrates the complete process from document parsing to answer generation. (b) PaperTree Index: shows hierarchical organization of papers with contextualized summaries. (c) Path-Guided Retrieval: highlights query-adaptive path selection based on semantic relevance.
  • Figure 3: Ablation study under matched budgets. Removing the contextual summary or the path-guided retrieval module reduces answer quality.
  • Figure 4: Total efficiency and cost analysis under matched budgets. PT-RAG achieves lower latency and cost relative to advanced baselines.