Table of Contents
Fetching ...

HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking

Wensheng Lu, Keyu Chen, Ruizhi Qiao, Xing Sun

TL;DR

HiChunk targets the evaluation gap in Retrieval-Augmented Generation (RAG) caused by evidence sparsity in existing chunking benchmarks. It introduces HiCBench, a hierarchical document QA benchmark with annotated chunking points, evidence-dense QA pairs, and explicit evidence sources, and HiChunk, a framework for multi-level document structuring coupled with an Auto-Merge retrieval algorithm. The approach enables explicit assessment of chunking effects across the chunker, retriever, and responder components and demonstrates improvements in chunking accuracy and end-to-end RAG performance across multiple datasets. The work offers a practical path toward more reliable long-document retrieval and improved knowledge integration in RAG systems.

Abstract

Retrieval-Augmented Generation (RAG) enhances the response capabilities of language models by integrating external knowledge sources. However, document chunking as an important part of RAG system often lacks effective evaluation tools. This paper first analyzes why existing RAG evaluation benchmarks are inadequate for assessing document chunking quality, specifically due to evidence sparsity. Based on this conclusion, we propose HiCBench, which includes manually annotated multi-level document chunking points, synthesized evidence-dense quetion answer(QA) pairs, and their corresponding evidence sources. Additionally, we introduce the HiChunk framework, a multi-level document structuring framework based on fine-tuned LLMs, combined with the Auto-Merge retrieval algorithm to improve retrieval quality. Experiments demonstrate that HiCBench effectively evaluates the impact of different chunking methods across the entire RAG pipeline. Moreover, HiChunk achieves better chunking quality within reasonable time consumption, thereby enhancing the overall performance of RAG systems.

HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking

TL;DR

HiChunk targets the evaluation gap in Retrieval-Augmented Generation (RAG) caused by evidence sparsity in existing chunking benchmarks. It introduces HiCBench, a hierarchical document QA benchmark with annotated chunking points, evidence-dense QA pairs, and explicit evidence sources, and HiChunk, a framework for multi-level document structuring coupled with an Auto-Merge retrieval algorithm. The approach enables explicit assessment of chunking effects across the chunker, retriever, and responder components and demonstrates improvements in chunking accuracy and end-to-end RAG performance across multiple datasets. The work offers a practical path toward more reliable long-document retrieval and improved knowledge integration in RAG systems.

Abstract

Retrieval-Augmented Generation (RAG) enhances the response capabilities of language models by integrating external knowledge sources. However, document chunking as an important part of RAG system often lacks effective evaluation tools. This paper first analyzes why existing RAG evaluation benchmarks are inadequate for assessing document chunking quality, specifically due to evidence sparsity. Based on this conclusion, we propose HiCBench, which includes manually annotated multi-level document chunking points, synthesized evidence-dense quetion answer(QA) pairs, and their corresponding evidence sources. Additionally, we introduce the HiChunk framework, a multi-level document structuring framework based on fine-tuned LLMs, combined with the Auto-Merge retrieval algorithm to improve retrieval quality. Experiments demonstrate that HiCBench effectively evaluates the impact of different chunking methods across the entire RAG pipeline. Moreover, HiChunk achieves better chunking quality within reasonable time consumption, thereby enhancing the overall performance of RAG systems.

Paper Structure

This paper contains 23 sections, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Different chunk methods produce same answer.
  • Figure 2: Framework. (a) Iterative inference for HiChunk on long documents. (b) Auto-Merge retrieval algorithm.
  • Figure 3: Performance of HiCBench($T_1$) under different retrieval token budget from 2k to 4k.
  • Figure 4: Evidence recall metric across different maximum level on HiCBench($T_1$ and $T_2$).
  • Figure A1: Evidence recall metric across different token budget on HiCBench($T_1$).