Table of Contents
Fetching ...

A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity

Muhammad Arslan Shaukat, Muntasir Adnan, Carlos C. N. Kuhn

Abstract

We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning fixed-size, semantic, structure-aware, hierarchical, adaptive, and LLM-assisted approaches are benchmarked across six diverse knowledge domains using five different embedding models. Retrieval performance is assessed using graded relevance scores from a state-of-the-art LLM evaluator, with Normalised DCG@5 as the primary metric (complemented by Hit@5 and MRR). Our experiments show that content-aware chunking significantly improves retrieval effectiveness over naive fixed-length splitting. The top-performing strategy, Paragraph Group Chunking, achieved the highest overall accuracy (mean nDCG@5~0.459) and substantially better top-rank hit rates (Precision@1~24%, Hit@5~59%). In contrast, simple fixed-size character chunking as baselines performed poorly (nDCG@5 < 0.244, Precision@1~2-3%). We observe pronounced domain-specific differences: dynamic token sizing is strongest in biology, physics and health, while paragraph grouping is strongest in legal and maths. Larger embedding models yield higher absolute scores but remain sensitive to suboptimal segmentation, indicating that better chunking and large embeddings provide complementary benefits. In addition to accuracy gains, we quantify the efficiency trade-offs of advanced chunking. Producing more, smaller chunks can increase index size and latency. Consequently, we identify methods (like dynamic chunking) that approach an optimal balance of effectiveness and efficiency. These findings establish chunking as a vital lever for improving retrieval performance and reliability.

A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity

Abstract

We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning fixed-size, semantic, structure-aware, hierarchical, adaptive, and LLM-assisted approaches are benchmarked across six diverse knowledge domains using five different embedding models. Retrieval performance is assessed using graded relevance scores from a state-of-the-art LLM evaluator, with Normalised DCG@5 as the primary metric (complemented by Hit@5 and MRR). Our experiments show that content-aware chunking significantly improves retrieval effectiveness over naive fixed-length splitting. The top-performing strategy, Paragraph Group Chunking, achieved the highest overall accuracy (mean nDCG@5~0.459) and substantially better top-rank hit rates (Precision@1~24%, Hit@5~59%). In contrast, simple fixed-size character chunking as baselines performed poorly (nDCG@5 < 0.244, Precision@1~2-3%). We observe pronounced domain-specific differences: dynamic token sizing is strongest in biology, physics and health, while paragraph grouping is strongest in legal and maths. Larger embedding models yield higher absolute scores but remain sensitive to suboptimal segmentation, indicating that better chunking and large embeddings provide complementary benefits. In addition to accuracy gains, we quantify the efficiency trade-offs of advanced chunking. Producing more, smaller chunks can increase index size and latency. Consequently, we identify methods (like dynamic chunking) that approach an optimal balance of effectiveness and efficiency. These findings establish chunking as a vital lever for improving retrieval performance and reliability.
Paper Structure (25 sections, 5 equations, 6 figures, 3 tables)

This paper contains 25 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: End-to-end experimental pipeline for evaluating document chunking strategies in dense retrieval. Documents from six knowledge domains (Biology, Physics, Health, Legal, Maths, Agriculture) in the UltraDomain dataset are segmented using 36 chunking strategies spanning six design categories. The resulting chunks are embedded using five dense embedding models and indexed into separate Qdrant vector stores, yielding 1,080 unique configurations. For each query, the top-5 retrieved chunks are evaluated by a Mixtral-8x22B LLM judge against the golden reference answer using a three-point graded relevance scale. The query is intentionally withheld from the evaluator to prevent lexical bias. Efficiency metrics - including index size, query latency, and memory usage are recorded in parallel across all configurations. No generation component is included; evaluation is retrieval only.
  • Figure 2: Distribution of nDCG@5 for all chunking strategies across embedding models and domains, ordered by descending mean nDCG@5. Red diamonds denote mean values and black lines denote medians. Higher-ranked methods such as PGC, HPGC, and DFC show stronger overall retrieval effectiveness, while FC, FCC, and HFCF exhibit weaker performance. Differences in box and whisker widths highlight stability variations acroos queries. The chunking method abbreviations on the x-axis correspond to those defined in Table \ref{['table1']}
  • Figure 3: Mean nDCG@5 scores for Domain-specific retrieval performance across Agriculture, Biology, Health, Legal, Maths and Physics. Dynamic Token Size Chunking (DFC) achieves the highest single score in Health and Physics while paragraph-aware and late-stage strategies, particularly Paragraph Group Chunking (PGC), Hybrid Paragraph Group Fixed Token Chunking (HPGC), and Late Chunking Token Spans (LCTS), consistently rank among the top three methods across all domains. The chunking method abbreviations on the x-axis correspond to those defined in Table \ref{['table1']}.
  • Figure 4: Heatmap of mean nDCG@5 across five embedding models and thirty-six chunking strategies. The results show that bge-m3 (0.456) delivers the strongest overall performance, with all-MiniLM-L6-v2 (0.416) as the clear second best. The potion-base variants trail behind, with scores generally tapering off across later chunking strategies. The chunking method abbreviations on the x-axis correspond to those defined in Table \ref{['table1']}.
  • Figure 5: Effectiveness-efficiency trade-off plots. Left: mean nDCG@5 against index size, highlighting how each strategy scales in terms of index growth. Right: mean nDCG@5 against query latency, illustrating the trade-off between retrieval accuracy and response time for each chunking approach. The chunking method abbreviations on the x-axis correspond to those defined in Table \ref{['table1']}.
  • ...and 1 more figures