Intent-Driven Dynamic Chunking: Segmenting Documents to Reflect Predicted Information Needs

Christos Koutsiaris

Intent-Driven Dynamic Chunking: Segmenting Documents to Reflect Predicted Information Needs

Christos Koutsiaris

TL;DR

Intent-Driven Dynamic Chunking (IDC) is introduced, a novel approach that uses predicted user queries to guide document segmentation and aligning document structure with anticipated information needs significantly boosts retrieval performance, particularly for long and heterogeneous documents.

Abstract

Breaking long documents into smaller segments is a fundamental challenge in information retrieval. Whether for search engines, question-answering systems, or retrieval-augmented generation (RAG), effective segmentation determines how well systems can locate and return relevant information. However, traditional methods, such as fixed-length or coherence-based segmentation, ignore user intent, leading to chunks that split answers or contain irrelevant noise. We introduce Intent-Driven Dynamic Chunking (IDC), a novel approach that uses predicted user queries to guide document segmentation. IDC leverages a Large Language Model to generate likely user intents for a document and then employs a dynamic programming algorithm to find the globally optimal chunk boundaries. This represents a novel application of DP to intent-aware segmentation that avoids greedy pitfalls. We evaluated IDC on six diverse question-answering datasets, including news articles, Wikipedia, academic papers, and technical documentation. IDC outperformed traditional chunking strategies on five datasets, improving top-1 retrieval accuracy by 5% to 67%, and matched the best baseline on the sixth. Additionally, IDC produced 40-60% fewer chunks than baseline methods while achieving 93-100% answer coverage. These results demonstrate that aligning document structure with anticipated information needs significantly boosts retrieval performance, particularly for long and heterogeneous documents.

Intent-Driven Dynamic Chunking: Segmenting Documents to Reflect Predicted Information Needs

TL;DR

Abstract

Paper Structure (19 sections, 3 equations, 4 figures, 2 tables)

This paper contains 19 sections, 3 equations, 4 figures, 2 tables.

Introduction
Related Work
Document Segmentation Methods
Query-Aware Document Expansion
Methodology
Overview of IDC
Intent Simulation
Sentence Embedding and Scoring
Boundary Optimization
Experimental Setup
Datasets
Baselines
Evaluation Metrics
Results
Retrieval Performance
...and 4 more sections

Figures (4)

Figure 1: Recall@1 across datasets. IDC (red) consistently matches or exceeds the best baseline (gray), with largest gains on long documents (arXiv +67%, Fiori +60%).
Figure 2: Complete retrieval metrics (R@1, R@5, MRR) across all datasets and methods. IDC achieves the highest or tied-highest scores on 5 of 6 datasets.
Figure 3: Number of chunks produced by IDC vs baselines. IDC generates 40--60% fewer chunks while achieving higher retrieval performance.
Figure 4: Answer coverage: percentage of questions whose answer is fully contained within a single chunk. IDC achieves 93--100% coverage, compared to 80--87% for baselines.

Intent-Driven Dynamic Chunking: Segmenting Documents to Reflect Predicted Information Needs

TL;DR

Abstract

Intent-Driven Dynamic Chunking: Segmenting Documents to Reflect Predicted Information Needs

Authors

TL;DR

Abstract

Table of Contents

Figures (4)