Table of Contents
Fetching ...

BioRAG: A RAG-LLM Framework for Biological Question Reasoning

Chengrui Wang, Qingqing Long, Meng Xiao, Xunxin Cai, Chengjun Wu, Zhen Meng, Xuezhi Wang, Yuanchun Zhou

TL;DR

BioRAG proposes a retrieval-augmented LLM framework tailored for biological question reasoning, integrating a large PubMed-based internal corpus, external knowledge hubs, and a hierarchical knowledge structure to guide retrieval and tool use. It combines internal and external information sources with self-evaluated retrieval and customized prompts to generate evidence-based answers. Evaluations on six biology QA benchmarks show BioRAG outperforms fine-tuned LLMs, LLMs with search, and SciRAG-style baselines, with ablation confirming the importance of the Gene database, self-evaluation, and larger base models. The approach demonstrates strong potential for up-to-date, domain-aware reasoning in life sciences.

Abstract

The question-answering system for Life science research, which is characterized by the rapid pace of discovery, evolving insights, and complex interactions among knowledge entities, presents unique challenges in maintaining a comprehensive knowledge warehouse and accurate information retrieval. To address these issues, we introduce BioRAG, a novel Retrieval-Augmented Generation (RAG) with the Large Language Models (LLMs) framework. Our approach starts with parsing, indexing, and segmenting an extensive collection of 22 million scientific papers as the basic knowledge, followed by training a specialized embedding model tailored to this domain. Additionally, we enhance the vector retrieval process by incorporating a domain-specific knowledge hierarchy, which aids in modeling the intricate interrelationships among each query and context. For queries requiring the most current information, BioRAG deconstructs the question and employs an iterative retrieval process incorporated with the search engine for step-by-step reasoning. Rigorous experiments have demonstrated that our model outperforms fine-tuned LLM, LLM with search engines, and other scientific RAG frameworks across multiple life science question-answering tasks.

BioRAG: A RAG-LLM Framework for Biological Question Reasoning

TL;DR

BioRAG proposes a retrieval-augmented LLM framework tailored for biological question reasoning, integrating a large PubMed-based internal corpus, external knowledge hubs, and a hierarchical knowledge structure to guide retrieval and tool use. It combines internal and external information sources with self-evaluated retrieval and customized prompts to generate evidence-based answers. Evaluations on six biology QA benchmarks show BioRAG outperforms fine-tuned LLMs, LLMs with search, and SciRAG-style baselines, with ablation confirming the importance of the Gene database, self-evaluation, and larger base models. The approach demonstrates strong potential for up-to-date, domain-aware reasoning in life sciences.

Abstract

The question-answering system for Life science research, which is characterized by the rapid pace of discovery, evolving insights, and complex interactions among knowledge entities, presents unique challenges in maintaining a comprehensive knowledge warehouse and accurate information retrieval. To address these issues, we introduce BioRAG, a novel Retrieval-Augmented Generation (RAG) with the Large Language Models (LLMs) framework. Our approach starts with parsing, indexing, and segmenting an extensive collection of 22 million scientific papers as the basic knowledge, followed by training a specialized embedding model tailored to this domain. Additionally, we enhance the vector retrieval process by incorporating a domain-specific knowledge hierarchy, which aids in modeling the intricate interrelationships among each query and context. For queries requiring the most current information, BioRAG deconstructs the question and employs an iterative retrieval process incorporated with the search engine for step-by-step reasoning. Rigorous experiments have demonstrated that our model outperforms fine-tuned LLM, LLM with search engines, and other scientific RAG frameworks across multiple life science question-answering tasks.
Paper Structure (15 sections, 7 figures, 3 tables)

This paper contains 15 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An illustration of the difference between three paradigms: (a) fine-tuned language model embedded domain knowledge into deep space; (b) RAG-based method retrieve supplementary information from constructed knowledge base; (c) BioRAG adaptively select knowledge source and domain-specific tools to advance the biology question-reasoning task.
  • Figure 2: The architecture of our proposed BioRAG framework. The pipeline consists of five iterative components designed to enhance the process of biological question-reasoning: ① Retriever Selection aims to choose the most ideal information source; ② Query Pre-processing aims to rewrite the query and find closed topic tag from pre-defined knowledge hierarchy; ③ Retriever Execution aims to combination retrieve the correlated context from knowledge base; ④ Self-Evaluation assess the adequacy of the retrieved information and decides whether to cycle through additional retrieval tools or to move to the next phase; ⑤ Inference and Generation uses the information gathered to generate an informed and accurate answer to the biological query.
  • Figure 3: Training Template for $\mathcal{M}_{\text{MeSH}}$.
  • Figure 4: An example of MeSH filtering SQLs Generation.
  • Figure 5: A case study selected from the College Biology dataset.
  • ...and 2 more figures