Beyond Chunk-Then-Embed: A Comprehensive Taxonomy and Evaluation of Document Chunking Strategies for Information Retrieval
Yongjie Zhou, Shuai Wang, Bevan Koopman, Guido Zuccon
TL;DR
This work systematically unifies document chunking strategies for dense retrieval by framing them along two axes—segmentation methods and embedding-chunking ordering—and evaluating them across in-corpus and in-document retrieval. It reproduces prior results for LumberChunker and Late Chunking and benchmarks a wide set of segmentation methods, embedding models, and datasets (BEIR and GutenQA). Key findings show task-dependent preferences: simple structure-based chunking excels for in-corpus retrieval, while LumberChunker performs best for in-document retrieval; contextualized chunking helps in-corpus retrieval with LL-guided methods but degrades in-document retrieval, and chunk size is not the sole driver of performance. The paper provides practical guidelines for practitioners and releases code to facilitate reproducibility and further research.
Abstract
Document chunking is a critical preprocessing step in dense retrieval systems, yet the design space of chunking strategies remains poorly understood. Recent research has proposed several concurrent approaches, including LLM-guided methods (e.g., DenseX and LumberChunker) and contextualized strategies(e.g., Late Chunking), which generate embeddings before segmentation to preserve contextual information. However, these methods emerged independently and were evaluated on benchmarks with minimal overlap, making direct comparisons difficult. This paper reproduces prior studies in document chunking and presents a systematic framework that unifies existing strategies along two key dimensions: (1) segmentation methods, including structure-based methods (fixed-size, sentence-based, and paragraph-based) as well as semantically-informed and LLM-guided methods; and (2) embedding paradigms, which determine the timing of chunking relative to embedding (pre-embedding chunking vs. contextualized chunking). Our reproduction evaluates these approaches in two distinct retrieval settings established in previous work: in-document retrieval (needle-in-a-haystack) and in-corpus retrieval (the standard information retrieval task). Our comprehensive evaluation reveals that optimal chunking strategies are task-dependent: simple structure-based methods outperform LLM-guided alternatives for in-corpus retrieval, while LumberChunker performs best for in-document retrieval. Contextualized chunking improves in-corpus effectiveness but degrades in-document retrieval. We also find that chunk size correlates moderately with in-document but weakly with in-corpus effectiveness, suggesting segmentation method differences are not purely driven by chunk size. Our code and evaluation benchmarks are publicly available at (Anonymoused).
