Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception

Jihao Zhao; Zhiyuan Ji; Yuchen Feng; Pengnian Qi; Simin Niu; Bo Tang; Feiyu Xiong; Zhiyu Li

Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception

Jihao Zhao, Zhiyuan Ji, Yuchen Feng, Pengnian Qi, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li

TL;DR

This work tackles the chunking bottleneck in Retrieval-Augmented Generation by introducing Meta-Chunking, a framework that uses uncertainty-aware boundaries (Perplexity Chunking and Margin Sampling Chunking) and dynamic merging to create logically coherent chunks. It augments chunk integrity with a semantic completion pipeline comprising globally augmented rewriting and context-aware summarization, supported by large-scale synthetic training data for fine-tuning lightweight models. The approach improves chunk quality and retrieval coherence across five QA datasets, and demonstrates that high-quality chunking can be achieved with smaller models, reducing dependency on instruction-following capabilities while enhancing practical deployment. Overall, Meta-Chunking advances RAG by aligning chunk boundaries with logical structure and by restoring global information, enabling more reliable knowledge-intensive reasoning in LLM applications.

Abstract

While Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for boosting large language models (LLMs) in knowledge-intensive tasks, it often overlooks the crucial aspect of text chunking within its workflow. This paper proposes the Meta-Chunking framework, which specifically enhances chunking quality through a dual strategy that identifies optimal segmentation points and preserves global information. Initially, breaking limitations of similarity-based chunking, we design two adaptive chunking techniques based on uncertainty, namely Perplexity Chunking and Margin Sampling Chunking, by utilizing the logical perception capabilities of LLMs. Given the inherent complexity across different texts, we integrate meta-chunk with dynamic merging, striking a balance between fine-grained and coarse-grained text chunking. Furthermore, we establish the global information compensation mechanism, encompassing a two-stage hierarchical summary generation process and a three-stage text chunk rewriting procedure focused on missing reflection, refinement, and completion. These components collectively strengthen the semantic integrity and contextual coherence of chunks. Extensive experiments demonstrate that Meta-Chunking effectively addresses challenges of the chunking task within the RAG system, providing LLMs with more logically coherent text chunks. Additionally, our methodology validates the feasibility of implementing high-quality chunking tasks with smaller-scale models, thereby eliminating the reliance on robust instruction-following capabilities.

Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception

TL;DR

Abstract

Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)

Theorems & Definitions (2)