Table of Contents
Fetching ...

A MapReduce Approach to Effectively Utilize Long Context Information in Retrieval Augmented Language Models

Gongbo Zhang, Zihan Xu, Qiao Jin, Fangyi Chen, Yilu Fang, Yi Liu, Justin F. Rousseau, Ziyang Xu, Zhiyong Lu, Chunhua Weng, Yifan Peng

TL;DR

The paper tackles safety and accuracy gaps in healthcare LLMs by addressing the lost-in-the-middle problem in retrieval-augmented generation. It introduces BriefContext, a map-reduce workflow that splits long-context reasoning into parallel short-context subtasks via four modules (Retrieval, Preflight check, ContextMap, ContextReduce) to boost robustness without altering model weights. Through controlled experiments and integration tests across multiple LLM backbones and medical QA datasets, BriefContext demonstrates improved QA accuracy, better conflict resolution, and meaningful cost management via a preflight predictor. The approach offers practical implications for deploying healthcare LLMs with greater reliability and opens avenues for applying long-context processing to other domains requiring precise information extraction from large corpora.

Abstract

While holding great promise for improving and facilitating healthcare, large language models (LLMs) struggle to produce up-to-date responses on evolving topics due to outdated knowledge or hallucination. Retrieval-augmented generation (RAG) is a pivotal innovation that improves the accuracy and relevance of LLM responses by integrating LLMs with a search engine and external sources of knowledge. However, the quality of RAG responses can be largely impacted by the rank and density of key information in the retrieval results, such as the "lost-in-the-middle" problem. In this work, we aim to improve the robustness and reliability of the RAG workflow in the medical domain. Specifically, we propose a map-reduce strategy, BriefContext, to combat the "lost-in-the-middle" issue without modifying the model weights. We demonstrated the advantage of the workflow with various LLM backbones and on multiple QA datasets. This method promises to improve the safety and reliability of LLMs deployed in healthcare domains.

A MapReduce Approach to Effectively Utilize Long Context Information in Retrieval Augmented Language Models

TL;DR

The paper tackles safety and accuracy gaps in healthcare LLMs by addressing the lost-in-the-middle problem in retrieval-augmented generation. It introduces BriefContext, a map-reduce workflow that splits long-context reasoning into parallel short-context subtasks via four modules (Retrieval, Preflight check, ContextMap, ContextReduce) to boost robustness without altering model weights. Through controlled experiments and integration tests across multiple LLM backbones and medical QA datasets, BriefContext demonstrates improved QA accuracy, better conflict resolution, and meaningful cost management via a preflight predictor. The approach offers practical implications for deploying healthcare LLMs with greater reliability and opens avenues for applying long-context processing to other domains requiring precise information extraction from large corpora.

Abstract

While holding great promise for improving and facilitating healthcare, large language models (LLMs) struggle to produce up-to-date responses on evolving topics due to outdated knowledge or hallucination. Retrieval-augmented generation (RAG) is a pivotal innovation that improves the accuracy and relevance of LLM responses by integrating LLMs with a search engine and external sources of knowledge. However, the quality of RAG responses can be largely impacted by the rank and density of key information in the retrieval results, such as the "lost-in-the-middle" problem. In this work, we aim to improve the robustness and reliability of the RAG workflow in the medical domain. Specifically, we propose a map-reduce strategy, BriefContext, to combat the "lost-in-the-middle" issue without modifying the model weights. We demonstrated the advantage of the workflow with various LLM backbones and on multiple QA datasets. This method promises to improve the safety and reliability of LLMs deployed in healthcare domains.

Paper Structure

This paper contains 27 sections, 5 equations, 6 figures, 6 tables, 2 algorithms.

Figures (6)

  • Figure 1: Workflow of BriefContext. In the Context Map operation (1), the retrieved documents are divided into multiple partitions to create multiple RAG subtasks. In the Context Reduce operation (2), the responses were collected from the previous step and summarized into a final response.
  • Figure 2: Relationship between QA accuracy and positions of key information in the LLM context: (a-b) GPT-3.5-Turbo, (c-d) Mixtral-7x8b. The quartiles refer to the positions where the key document is located. Significance levels: * - $p < 0.05$; ** - $p < 0.01$; *** - $p < 0.001$; **** - $p < 0.0001$; ns - Not significant.
  • Figure 3: Integration testing of BriefContext with different LLM backbones: (a) Llama3-70B-instruct, (b) Llama2-70B-chat, (c) Mixtral-7x8b, and (d) GPT-3.5-turbo-0125. BC - BriefContext. RAG - Retrieval-augmented generation. CoT - Chain-of-Thought. Significance levels: * - $p < 0.05$; ** - $p < 0.01$; *** - $p < 0.001$; **** - $p < 0.0001$; ns - Not significant.
  • Figure 4: Number of cases (red) with conflict information provided to LLMs and number of correctly resolved cases (green).
  • Figure 5: Medical QA accuracy of LLMs with various numbers of documents as context information. The top solid line shows the performance in the Oracle settings. The bottom dotted line shows the performance of CoT. With the same key document in the context, the accuracy decreases as the number of documents increases. (a) Llama3-70B-instruct, (b) Llama2-70B-chat, (c) Mixtral-7x8b, and (d) GPT-3.5-turbo-0125. BC - BriefContext. RAG - Retrieval-augmented generation. CoT - Chain-of-Thought. Significance levels: * - $p < 0.05$; ** - $p < 0.01$; *** - $p < 0.001$; **** - $p < 0.0001$; ns - Not significant.
  • ...and 1 more figures