Table of Contents
Fetching ...

Breaking It Down: Domain-Aware Semantic Segmentation for Retrieval Augmented Generation

Aparajitha Allamraju, Maitreya Prafulla Chitale, Hiranmai Sri Adibhatla, Rahul Mishra, Manish Shrivastava

TL;DR

The paper tackles how document chunking impacts Retrieval-Augmented Generation (RAG), introducing two domain-trained semantic chunkers, Projected Similarity Chunking (PSC) and Metric Fusion Chunking (MFC). Trained on augmented PubMed data and evaluated with full-text PubMed Central articles, PSC and MFC improve retrieval (MRR and Hits@k) and generation quality across in-domain and out-of-domain datasets, with PSC notably fast and MFC+E5 delivering strong generation metrics. The study provides a unified, end-to-end evaluation framework combining augmented datasets (PubMedQA) and RAGBench, and demonstrates the practical viability and generalizability of semantic chunking for RAG pipelines. The authors also highlight that traditional lexical metrics may not fully capture semantic quality, advocating for boundary-aware chunking as a key factor in retrieval-aware generation.

Abstract

Document chunking is a crucial component of Retrieval-Augmented Generation (RAG), as it directly affects the retrieval of relevant and precise context. Conventional fixed-length and recursive splitters often produce arbitrary, incoherent segments that fail to preserve semantic structure. Although semantic chunking has gained traction, its influence on generation quality remains underexplored. This paper introduces two efficient semantic chunking methods, Projected Similarity Chunking (PSC) and Metric Fusion Chunking (MFC), trained on PubMed data using three different embedding models. We further present an evaluation framework that measures the effect of chunking on both retrieval and generation by augmenting PubMedQA with full-text PubMed Central articles. Our results show substantial retrieval improvements (24x with PSC) in MRR and higher Hits@k on PubMedQA. We provide a comprehensive analysis, including statistical significance and response-time comparisons with common chunking libraries. Despite being trained on a single domain, PSC and MFC also generalize well, achieving strong out-of-domain generation performance across multiple datasets. Overall, our findings confirm that our semantic chunkers, especially PSC, consistently deliver superior performance.

Breaking It Down: Domain-Aware Semantic Segmentation for Retrieval Augmented Generation

TL;DR

The paper tackles how document chunking impacts Retrieval-Augmented Generation (RAG), introducing two domain-trained semantic chunkers, Projected Similarity Chunking (PSC) and Metric Fusion Chunking (MFC). Trained on augmented PubMed data and evaluated with full-text PubMed Central articles, PSC and MFC improve retrieval (MRR and Hits@k) and generation quality across in-domain and out-of-domain datasets, with PSC notably fast and MFC+E5 delivering strong generation metrics. The study provides a unified, end-to-end evaluation framework combining augmented datasets (PubMedQA) and RAGBench, and demonstrates the practical viability and generalizability of semantic chunking for RAG pipelines. The authors also highlight that traditional lexical metrics may not fully capture semantic quality, advocating for boundary-aware chunking as a key factor in retrieval-aware generation.

Abstract

Document chunking is a crucial component of Retrieval-Augmented Generation (RAG), as it directly affects the retrieval of relevant and precise context. Conventional fixed-length and recursive splitters often produce arbitrary, incoherent segments that fail to preserve semantic structure. Although semantic chunking has gained traction, its influence on generation quality remains underexplored. This paper introduces two efficient semantic chunking methods, Projected Similarity Chunking (PSC) and Metric Fusion Chunking (MFC), trained on PubMed data using three different embedding models. We further present an evaluation framework that measures the effect of chunking on both retrieval and generation by augmenting PubMedQA with full-text PubMed Central articles. Our results show substantial retrieval improvements (24x with PSC) in MRR and higher Hits@k on PubMedQA. We provide a comprehensive analysis, including statistical significance and response-time comparisons with common chunking libraries. Despite being trained on a single domain, PSC and MFC also generalize well, achieving strong out-of-domain generation performance across multiple datasets. Overall, our findings confirm that our semantic chunkers, especially PSC, consistently deliver superior performance.

Paper Structure

This paper contains 12 sections, 5 tables.