Table of Contents
Fetching ...

Query pipeline optimization for cancer patient question answering systems

Maolin He, Rena Gao, Mike Conway, Brian E. Chapman

TL;DR

This paper tackles cancer patient question answering (CPQA) by addressing hallucination problems in large language models through retrieval-augmented generation (RAG). It introduces a three-aspect optimization framework for the RAG query pipeline: document retrieval (HSRDR), two-stage passage retrieval with domain-specific embeddings and rerankers, and precise semantic representation via the SEOS segmentation method. It includes the creation of a cancer-focused evaluation dataset (CMMQA) and demonstrates that the proposed approach yields measurable improvements over chain-of-thought prompts and naive RAG baselines on a cancer QA dataset using Claude-3-haiku, highlighting the value of domain-specific query optimization. The work provides a robust, generalizable framework for building more accurate and reliable CPQA systems and advances RAG-based biomedical QA methodologies by integrating real-time search, semantic retrieval, and adaptive text segmentation.

Abstract

Retrieval-augmented generation (RAG) mitigates hallucination in Large Language Models (LLMs) by using query pipelines to retrieve relevant external information and grounding responses in retrieved knowledge. However, query pipeline optimization for cancer patient question-answering (CPQA) systems requires separately optimizing multiple components with domain-specific considerations. We propose a novel three-aspect optimization approach for the RAG query pipeline in CPQA systems, utilizing public biomedical databases like PubMed and PubMed Central. Our optimization includes: (1) document retrieval, utilizing a comparative analysis of NCBI resources and introducing Hybrid Semantic Real-time Document Retrieval (HSRDR); (2) passage retrieval, identifying optimal pairings of dense retrievers and rerankers; and (3) semantic representation, introducing Semantic Enhanced Overlap Segmentation (SEOS) for improved contextual understanding. On a custom-developed dataset tailored for cancer-related inquiries, our optimized RAG approach improved the answer accuracy of Claude-3-haiku by 5.24% over chain-of-thought prompting and about 3% over a naive RAG setup. This study highlights the importance of domain-specific query optimization in realizing the full potential of RAG and provides a robust framework for building more accurate and reliable CPQA systems, advancing the development of RAG-based biomedical systems.

Query pipeline optimization for cancer patient question answering systems

TL;DR

This paper tackles cancer patient question answering (CPQA) by addressing hallucination problems in large language models through retrieval-augmented generation (RAG). It introduces a three-aspect optimization framework for the RAG query pipeline: document retrieval (HSRDR), two-stage passage retrieval with domain-specific embeddings and rerankers, and precise semantic representation via the SEOS segmentation method. It includes the creation of a cancer-focused evaluation dataset (CMMQA) and demonstrates that the proposed approach yields measurable improvements over chain-of-thought prompts and naive RAG baselines on a cancer QA dataset using Claude-3-haiku, highlighting the value of domain-specific query optimization. The work provides a robust, generalizable framework for building more accurate and reliable CPQA systems and advances RAG-based biomedical QA methodologies by integrating real-time search, semantic retrieval, and adaptive text segmentation.

Abstract

Retrieval-augmented generation (RAG) mitigates hallucination in Large Language Models (LLMs) by using query pipelines to retrieve relevant external information and grounding responses in retrieved knowledge. However, query pipeline optimization for cancer patient question-answering (CPQA) systems requires separately optimizing multiple components with domain-specific considerations. We propose a novel three-aspect optimization approach for the RAG query pipeline in CPQA systems, utilizing public biomedical databases like PubMed and PubMed Central. Our optimization includes: (1) document retrieval, utilizing a comparative analysis of NCBI resources and introducing Hybrid Semantic Real-time Document Retrieval (HSRDR); (2) passage retrieval, identifying optimal pairings of dense retrievers and rerankers; and (3) semantic representation, introducing Semantic Enhanced Overlap Segmentation (SEOS) for improved contextual understanding. On a custom-developed dataset tailored for cancer-related inquiries, our optimized RAG approach improved the answer accuracy of Claude-3-haiku by 5.24% over chain-of-thought prompting and about 3% over a naive RAG setup. This study highlights the importance of domain-specific query optimization in realizing the full potential of RAG and provides a robust framework for building more accurate and reliable CPQA systems, advancing the development of RAG-based biomedical systems.

Paper Structure

This paper contains 19 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Description of filtered cancer QA datasets used in this study.
  • Figure 2: The HSRDR employs dual retrieval strategies, then downloads and filters candidate documents. After document Retrieval, next steps and comparative analyses are conducted
  • Figure 3: Distribution comparison between Initial Document Pool and Top-5 Retrieved Evidence when HSRDR's Retrieval Source involving PubMed Abstract, PMC Reviews and PMC Others
  • Figure 4: Performance of Embedding Models with rerankers
  • Figure :