LumberChunker: Long-Form Narrative Document Segmentation

André V. Duarte; João Marques; Miguel Graça; Miguel Freire; Lei Li; Arlindo L. Oliveira

LumberChunker: Long-Form Narrative Document Segmentation

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

TL;DR

This work proposes LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift, and proves to be more effective than other chunking methods and competitive baselines.

Abstract

Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content's semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 "needle in a haystack" type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro. Our Code and Data are available at https://github.com/joaodsmarques/LumberChunker

LumberChunker: Long-Form Narrative Document Segmentation

TL;DR

Abstract

Paper Structure (22 sections, 4 figures, 10 tables)

This paper contains 22 sections, 4 figures, 10 tables.

Introduction
Background
Methodology
LumberChunker
GutenQA
Experiments
Results and Discussion
Context Size
Main Results
Impact on QA Systems
Conclusions
Limitations
Ethical Considerations
Propositions Example on Narrative Texts
LumberChunker Gemini Prompt
...and 7 more sections

Figures (4)

Figure 1: LumberChunker follows a three-step process. First, we segment a document paragraph-wise. Secondly, a group ($G_i$) is created by appending sequential chunks until exceeding a predefined token count $\theta$. Finally, $G_i$ is fed as context to Gemini, which determines the ID where a significant content shift starts to appear, thus defining the start of $G_{i+1}$ and the end of the current chunk. This process is cyclically repeated for the entire document.
Figure 2: Optimizing Context Size $\theta$ ($\approx$ number of tokens in the LumberChunker prompt.)
Figure 3: QA Accuracy on Autobiographies Test Set.
Figure 4: RAG Pipeline for QA on Autobiographies

LumberChunker: Long-Form Narrative Document Segmentation

TL;DR

Abstract

LumberChunker: Long-Form Narrative Document Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)