Table of Contents
Fetching ...

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

Jihao Zhao, Zhiyuan Ji, Zhaoxin Fan, Hanyu Wang, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li

TL;DR

The paper addresses the bottleneck of text chunking in retrieval-augmented generation by introducing Boundary Clarity and Chunk Stickiness as direct chunking-quality metrics and proposing the Mixture-of-Chunkers (MoC) framework with a multi-granularity router and regex-based chunk extraction. It demonstrates that LLMS-based chunking yields clearer boundaries and looser chunk stickiness, improving retrieval and answer quality over traditional chunking methods. The MoC design achieves comparable or better performance with single-SLM-style computation through sparse activation and post-processing, validated on multiple QA datasets and LMs. These contributions offer a scalable, efficient approach to chunk-aware RAG and provide practical tools for chunking evaluation and deployment.

Abstract

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

TL;DR

The paper addresses the bottleneck of text chunking in retrieval-augmented generation by introducing Boundary Clarity and Chunk Stickiness as direct chunking-quality metrics and proposing the Mixture-of-Chunkers (MoC) framework with a multi-granularity router and regex-based chunk extraction. It demonstrates that LLMS-based chunking yields clearer boundaries and looser chunk stickiness, improving retrieval and answer quality over traditional chunking methods. The MoC design achieves comparable or better performance with single-SLM-style computation through sparse activation and post-processing, validated on multiple QA datasets and LMs. These contributions offer a scalable, efficient approach to chunk-aware RAG and provide practical tools for chunking evaluation and deployment.

Abstract

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.

Paper Structure

This paper contains 30 sections, 9 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Overview of the entire process of granularity-aware MoC: Dataset construction, training of router and meta-chunkers, as well as chunking inference.
  • Figure 2: Performance sensitivity to temperature and top-k.
  • Figure 3: Score distribution of attention heads before fine-tuning.
  • Figure 4: Score distribution of attention heads after fine-tuning.
  • Figure 5: Granularity distribution of text chunks generated by GPT-4o on the CRUD benchmark.
  • ...and 6 more figures