Table of Contents
Fetching ...

Graph-Aware Late Chunking for Retrieval-Augmented Generation in Biomedical Literature

Pouria Mortezaagha, Arya Rahgozar

Abstract

Retrieval-Augmented Generation (RAG) systems for biomedical literature are typically evaluated using ranking metrics like Mean Reciprocal Rank (MRR), which measure how well the system identifies the single most relevant chunk. We argue that for full-text scientific documents, this paradigm is incomplete: it rewards retrieval precision while ignoring retrieval breadth -- the ability to surface evidence from across a document's structural sections. We propose GraLC-RAG, a framework that unifies late chunking with graph-aware structural intelligence, introducing structure-aware chunk boundary detection, UMLS knowledge graph infusion, and graph-guided hybrid retrieval. We evaluate six strategies on 2,359 IMRaD-filtered PubMed Central articles using 2,033 cross-section questions and two metric families: standard ranking metrics (MRR, Recall@k) and structural coverage metrics (SecCov@k, CS Recall). Our results expose a sharp divergence: content-similarity methods achieve the highest MRR (0.517) but always retrieve from a single section, while structure-aware methods retrieve from up to 15.6x more sections. Generation experiments show that KG-infused retrieval narrows the answer-quality gap to delta-F1 = 0.009 while maintaining 4.6x section diversity. These findings demonstrate that standard metrics systematically undervalue structural retrieval and that closing the multi-section synthesis gap is a key open problem for biomedical RAG.

Graph-Aware Late Chunking for Retrieval-Augmented Generation in Biomedical Literature

Abstract

Retrieval-Augmented Generation (RAG) systems for biomedical literature are typically evaluated using ranking metrics like Mean Reciprocal Rank (MRR), which measure how well the system identifies the single most relevant chunk. We argue that for full-text scientific documents, this paradigm is incomplete: it rewards retrieval precision while ignoring retrieval breadth -- the ability to surface evidence from across a document's structural sections. We propose GraLC-RAG, a framework that unifies late chunking with graph-aware structural intelligence, introducing structure-aware chunk boundary detection, UMLS knowledge graph infusion, and graph-guided hybrid retrieval. We evaluate six strategies on 2,359 IMRaD-filtered PubMed Central articles using 2,033 cross-section questions and two metric families: standard ranking metrics (MRR, Recall@k) and structural coverage metrics (SecCov@k, CS Recall). Our results expose a sharp divergence: content-similarity methods achieve the highest MRR (0.517) but always retrieve from a single section, while structure-aware methods retrieve from up to 15.6x more sections. Generation experiments show that KG-infused retrieval narrows the answer-quality gap to delta-F1 = 0.009 while maintaining 4.6x section diversity. These findings demonstrate that standard metrics systematically undervalue structural retrieval and that closing the multi-section synthesis gap is a key open problem for biomedical RAG.
Paper Structure (47 sections, 9 equations, 7 figures, 6 tables)

This paper contains 47 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: GraLC-RAG framework architecture. Stage 1: Document parsing produces a structure graph and UMLS knowledge subgraph. Stage 2: Full-document transformer encoding. Stage 3: KG infusion via GAT attention. Stage 4: Structure-aware boundary detection and chunk embedding. Stage 5: Graph-guided hybrid retrieval.
  • Figure 2: Mean Reciprocal Rank across six chunking strategies on PubMedQA* (1,000 questions). Semantic chunking achieves the highest MRR. GraLC-RAG variants show slight degradation on short abstracts.
  • Figure 3: Recall@$k$ ($k \!\in\! \{1,3,5,10\}$) across all strategies. Performance converges at higher $k$, with all methods exceeding 0.99 at $k\!=\!10$.
  • Figure 4: Ablation: incremental effect of each GraLC-RAG component on MRR. The dashed line marks the late chunking baseline. Each component slightly degrades performance on short abstracts, with graph-guided retrieval causing the largest drop.
  • Figure 5: Indexing efficiency on 200 full-text PMC articles (CPU). Left axis: number of chunks produced. Right axis: indexing time. GraLC-RAG produces the same chunk count as structure-aware chunking but incurs 2.5$\times$ overhead from KG infusion.
  • ...and 2 more figures