Table of Contents
Fetching ...

Multiple Abstraction Level Retrieve Augment Generation

Zheng Zheng, Xinyi Ni, Pengyu Hong

TL;DR

The paper tackles the challenge that single-abstraction, fixed-size chunking in Retrieval-Augmented Generation (RAG) struggles to support questions spanning multiple levels of abstraction and is constrained by token limits. It introduces MAL-RAG, a framework that builds four hierarchical chunk levels (document, section, paragraph, and multi-sentence) via a map-reduce summarization pipeline and retains background detail at finer levels, enabling flexible, context-aware retrieval. A dynamic retrieval strategy with softmax-based chunk weighting and a length-threshold constraint selects relevant content, which is then used by Vicuna-13B-v1.3 with in-context learning to generate answers. On a Glycoscience dataset, MAL-RAG outperforms traditional single-level RAG, achieving a substantial improvement in Answer Correctness (up to 25.739% over single-level baselines) and providing an 800-question domain-specific benchmark to support future research. This work demonstrates a practical path toward domain-specific, explainable QA by leveraging document structure to balance information richness and noise.

Abstract

A Retrieval-Augmented Generation (RAG) model powered by a large language model (LLM) provides a faster and more cost-effective solution for adapting to new data and knowledge. It also delivers more specialized responses compared to pre-trained LLMs. However, most existing approaches rely on retrieving prefix-sized chunks as references to support question-answering (Q/A). This approach is often deployed to address information needs at a single level of abstraction, as it struggles to generate answers across multiple levels of abstraction. In an RAG setting, while LLMs can summarize and answer questions effectively when provided with sufficient details, retrieving excessive information often leads to the 'lost in the middle' problem and exceeds token limitations. We propose a novel RAG approach that uses chunks of multiple abstraction levels (MAL), including multi-sentence-level, paragraph-level, section-level, and document-level. The effectiveness of our approach is demonstrated in an under-explored scientific domain of Glycoscience. Compared to traditional single-level RAG approaches, our approach improves AI evaluated answer correctness of Q/A by 25.739\% on Glyco-related papers.

Multiple Abstraction Level Retrieve Augment Generation

TL;DR

The paper tackles the challenge that single-abstraction, fixed-size chunking in Retrieval-Augmented Generation (RAG) struggles to support questions spanning multiple levels of abstraction and is constrained by token limits. It introduces MAL-RAG, a framework that builds four hierarchical chunk levels (document, section, paragraph, and multi-sentence) via a map-reduce summarization pipeline and retains background detail at finer levels, enabling flexible, context-aware retrieval. A dynamic retrieval strategy with softmax-based chunk weighting and a length-threshold constraint selects relevant content, which is then used by Vicuna-13B-v1.3 with in-context learning to generate answers. On a Glycoscience dataset, MAL-RAG outperforms traditional single-level RAG, achieving a substantial improvement in Answer Correctness (up to 25.739% over single-level baselines) and providing an 800-question domain-specific benchmark to support future research. This work demonstrates a practical path toward domain-specific, explainable QA by leveraging document structure to balance information richness and noise.

Abstract

A Retrieval-Augmented Generation (RAG) model powered by a large language model (LLM) provides a faster and more cost-effective solution for adapting to new data and knowledge. It also delivers more specialized responses compared to pre-trained LLMs. However, most existing approaches rely on retrieving prefix-sized chunks as references to support question-answering (Q/A). This approach is often deployed to address information needs at a single level of abstraction, as it struggles to generate answers across multiple levels of abstraction. In an RAG setting, while LLMs can summarize and answer questions effectively when provided with sufficient details, retrieving excessive information often leads to the 'lost in the middle' problem and exceeds token limitations. We propose a novel RAG approach that uses chunks of multiple abstraction levels (MAL), including multi-sentence-level, paragraph-level, section-level, and document-level. The effectiveness of our approach is demonstrated in an under-explored scientific domain of Glycoscience. Compared to traditional single-level RAG approaches, our approach improves AI evaluated answer correctness of Q/A by 25.739\% on Glyco-related papers.

Paper Structure

This paper contains 18 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Comparison of Vanilla RAG (Left) and MAL-RAG (Right). In MAL-RAG, $D$, $S$, $P$, and $M$ indicate document-level chunks, section-level chunks, paragraph-level chunks, and multi-sentence-level chunks, respectively. Vanilla RAG, which uses fixed-length chunks, often encounters challenges such as the "lost in the middle" effect liu2024lost. In contrast, MAL-RAG mitigates this problem by utilizing higher-level chunks enriched with summary information.
  • Figure 2: MAL-RAG Pipeline. The MAL-RAG pipeline is composed of two primary stages: indexing and inference. In the indexing stage, articles are divided into multiple levels of granularity, such as document-level, section-level, paragraph-level, and multi-sentence-level text. A map-reduce approach is then used to extract key information from paragraph-level chunks, which are summarized into section-level chunks. These section-level chunks are further processed to generate document-level chunks in a similar manner. In the inference stage, a search engine retrieves relevant chunks based on similarity scores, which are computed using the Linq-Embed-Mistral open-source embedding model. These retrieved chunks, along with the input question and prompts, are fed into GPT-4o-mini to generate the final response.