Table of Contents
Fetching ...

Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception

Jihao Zhao, Zhiyuan Ji, Yuchen Feng, Pengnian Qi, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li

TL;DR

This work tackles the chunking bottleneck in Retrieval-Augmented Generation by introducing Meta-Chunking, a framework that uses uncertainty-aware boundaries (Perplexity Chunking and Margin Sampling Chunking) and dynamic merging to create logically coherent chunks. It augments chunk integrity with a semantic completion pipeline comprising globally augmented rewriting and context-aware summarization, supported by large-scale synthetic training data for fine-tuning lightweight models. The approach improves chunk quality and retrieval coherence across five QA datasets, and demonstrates that high-quality chunking can be achieved with smaller models, reducing dependency on instruction-following capabilities while enhancing practical deployment. Overall, Meta-Chunking advances RAG by aligning chunk boundaries with logical structure and by restoring global information, enabling more reliable knowledge-intensive reasoning in LLM applications.

Abstract

While Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for boosting large language models (LLMs) in knowledge-intensive tasks, it often overlooks the crucial aspect of text chunking within its workflow. This paper proposes the Meta-Chunking framework, which specifically enhances chunking quality through a dual strategy that identifies optimal segmentation points and preserves global information. Initially, breaking limitations of similarity-based chunking, we design two adaptive chunking techniques based on uncertainty, namely Perplexity Chunking and Margin Sampling Chunking, by utilizing the logical perception capabilities of LLMs. Given the inherent complexity across different texts, we integrate meta-chunk with dynamic merging, striking a balance between fine-grained and coarse-grained text chunking. Furthermore, we establish the global information compensation mechanism, encompassing a two-stage hierarchical summary generation process and a three-stage text chunk rewriting procedure focused on missing reflection, refinement, and completion. These components collectively strengthen the semantic integrity and contextual coherence of chunks. Extensive experiments demonstrate that Meta-Chunking effectively addresses challenges of the chunking task within the RAG system, providing LLMs with more logically coherent text chunks. Additionally, our methodology validates the feasibility of implementing high-quality chunking tasks with smaller-scale models, thereby eliminating the reliance on robust instruction-following capabilities.

Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception

TL;DR

This work tackles the chunking bottleneck in Retrieval-Augmented Generation by introducing Meta-Chunking, a framework that uses uncertainty-aware boundaries (Perplexity Chunking and Margin Sampling Chunking) and dynamic merging to create logically coherent chunks. It augments chunk integrity with a semantic completion pipeline comprising globally augmented rewriting and context-aware summarization, supported by large-scale synthetic training data for fine-tuning lightweight models. The approach improves chunk quality and retrieval coherence across five QA datasets, and demonstrates that high-quality chunking can be achieved with smaller models, reducing dependency on instruction-following capabilities while enhancing practical deployment. Overall, Meta-Chunking advances RAG by aligning chunk boundaries with logical structure and by restoring global information, enabling more reliable knowledge-intensive reasoning in LLM applications.

Abstract

While Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for boosting large language models (LLMs) in knowledge-intensive tasks, it often overlooks the crucial aspect of text chunking within its workflow. This paper proposes the Meta-Chunking framework, which specifically enhances chunking quality through a dual strategy that identifies optimal segmentation points and preserves global information. Initially, breaking limitations of similarity-based chunking, we design two adaptive chunking techniques based on uncertainty, namely Perplexity Chunking and Margin Sampling Chunking, by utilizing the logical perception capabilities of LLMs. Given the inherent complexity across different texts, we integrate meta-chunk with dynamic merging, striking a balance between fine-grained and coarse-grained text chunking. Furthermore, we establish the global information compensation mechanism, encompassing a two-stage hierarchical summary generation process and a three-stage text chunk rewriting procedure focused on missing reflection, refinement, and completion. These components collectively strengthen the semantic integrity and contextual coherence of chunks. Extensive experiments demonstrate that Meta-Chunking effectively addresses challenges of the chunking task within the RAG system, providing LLMs with more logically coherent text chunks. Additionally, our methodology validates the feasibility of implementing high-quality chunking tasks with smaller-scale models, thereby eliminating the reliance on robust instruction-following capabilities.

Paper Structure

This paper contains 34 sections, 14 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Overview of the entire process of Meta-Chunking. Each circle represents a complete sentence, and the sentence lengths are not consistent. The vertical lines indicate where to segment. Circles with the same background color represent a meta-chunk, which is dynamically combined to make the final chunk length meet user needs.
  • Figure 2: Performance comparison of MSP Chunking using two types of prompts across LLMs of different sizes.
  • Figure 3: Overview of RAG pipeline, as well as examples based on rules, similarity, and PPL Chunking. The same background color represents being located in the same chunk.
  • Figure 4: Examples of PPL value variations and semantic similarity for sentences with different logical relationships, where $x\supset y$, $x|y$, $x->y$, and $x:=y$ refer to general-specific, parallel, sequential, and illustrative relationships, respectively.
  • Figure 5: Trends in PPL distribution variations between original and rewritten text chunks across different LLMs.
  • ...and 3 more figures

Theorems & Definitions (2)

  • proof
  • proof