Table of Contents
Fetching ...

Intent-Driven Dynamic Chunking: Segmenting Documents to Reflect Predicted Information Needs

Christos Koutsiaris

TL;DR

Intent-Driven Dynamic Chunking (IDC) is introduced, a novel approach that uses predicted user queries to guide document segmentation and aligning document structure with anticipated information needs significantly boosts retrieval performance, particularly for long and heterogeneous documents.

Abstract

Breaking long documents into smaller segments is a fundamental challenge in information retrieval. Whether for search engines, question-answering systems, or retrieval-augmented generation (RAG), effective segmentation determines how well systems can locate and return relevant information. However, traditional methods, such as fixed-length or coherence-based segmentation, ignore user intent, leading to chunks that split answers or contain irrelevant noise. We introduce Intent-Driven Dynamic Chunking (IDC), a novel approach that uses predicted user queries to guide document segmentation. IDC leverages a Large Language Model to generate likely user intents for a document and then employs a dynamic programming algorithm to find the globally optimal chunk boundaries. This represents a novel application of DP to intent-aware segmentation that avoids greedy pitfalls. We evaluated IDC on six diverse question-answering datasets, including news articles, Wikipedia, academic papers, and technical documentation. IDC outperformed traditional chunking strategies on five datasets, improving top-1 retrieval accuracy by 5% to 67%, and matched the best baseline on the sixth. Additionally, IDC produced 40-60% fewer chunks than baseline methods while achieving 93-100% answer coverage. These results demonstrate that aligning document structure with anticipated information needs significantly boosts retrieval performance, particularly for long and heterogeneous documents.

Intent-Driven Dynamic Chunking: Segmenting Documents to Reflect Predicted Information Needs

TL;DR

Intent-Driven Dynamic Chunking (IDC) is introduced, a novel approach that uses predicted user queries to guide document segmentation and aligning document structure with anticipated information needs significantly boosts retrieval performance, particularly for long and heterogeneous documents.

Abstract

Breaking long documents into smaller segments is a fundamental challenge in information retrieval. Whether for search engines, question-answering systems, or retrieval-augmented generation (RAG), effective segmentation determines how well systems can locate and return relevant information. However, traditional methods, such as fixed-length or coherence-based segmentation, ignore user intent, leading to chunks that split answers or contain irrelevant noise. We introduce Intent-Driven Dynamic Chunking (IDC), a novel approach that uses predicted user queries to guide document segmentation. IDC leverages a Large Language Model to generate likely user intents for a document and then employs a dynamic programming algorithm to find the globally optimal chunk boundaries. This represents a novel application of DP to intent-aware segmentation that avoids greedy pitfalls. We evaluated IDC on six diverse question-answering datasets, including news articles, Wikipedia, academic papers, and technical documentation. IDC outperformed traditional chunking strategies on five datasets, improving top-1 retrieval accuracy by 5% to 67%, and matched the best baseline on the sixth. Additionally, IDC produced 40-60% fewer chunks than baseline methods while achieving 93-100% answer coverage. These results demonstrate that aligning document structure with anticipated information needs significantly boosts retrieval performance, particularly for long and heterogeneous documents.
Paper Structure (19 sections, 3 equations, 4 figures, 2 tables)

This paper contains 19 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Recall@1 across datasets. IDC (red) consistently matches or exceeds the best baseline (gray), with largest gains on long documents (arXiv +67%, Fiori +60%).
  • Figure 2: Complete retrieval metrics (R@1, R@5, MRR) across all datasets and methods. IDC achieves the highest or tied-highest scores on 5 of 6 datasets.
  • Figure 3: Number of chunks produced by IDC vs baselines. IDC generates 40--60% fewer chunks while achieving higher retrieval performance.
  • Figure 4: Answer coverage: percentage of questions whose answer is fully contained within a single chunk. IDC achieves 93--100% coverage, compared to 80--87% for baselines.