Table of Contents
Fetching ...

MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation

Seonok Kim

TL;DR

MedBioLM addresses the challenge of reliable biomedical QA by fusing domain-specific fine-tuning with retrieval-augmented generation and task-aware prompting. The approach yields strong closed-ended performance (e.g., MedQA 88%, BioASQ 96%), meaningful long-form gains (ROUGE/BLEU) and robust short-form results, where fine-tuning dominates and RAG provides targeted factual reinforcement. Across diverse datasets, the work demonstrates the value of domain adaptation for medical reasoning and supports its application to clinical decision support and biomedical research tools. The findings also reveal limitations, such as BLEURT bottlenecks and variable RAG impact, guiding future work toward improved evaluation, retrieval strategies, and human-in-the-loop refinement.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities across natural language processing tasks. However, their application to specialized domains such as medicine and biology requires further optimization to ensure factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a domain-adapted biomedical question-answering model designed to enhance both short-form and long-form queries. By integrating fine-tuning and retrieval-augmented generation (RAG), MedBioLM dynamically incorporates domain-specific knowledge, improving reasoning abilities and factual accuracy. To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA datasets, covering structured multiple-choice assessments and complex clinical reasoning tasks. Fine-tuning significantly improves accuracy on benchmark datasets, while RAG enhances factual consistency. These results highlight the potential of domain-optimized LLMs in advancing biomedical research, medical education, and clinical decision support.

MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation

TL;DR

MedBioLM addresses the challenge of reliable biomedical QA by fusing domain-specific fine-tuning with retrieval-augmented generation and task-aware prompting. The approach yields strong closed-ended performance (e.g., MedQA 88%, BioASQ 96%), meaningful long-form gains (ROUGE/BLEU) and robust short-form results, where fine-tuning dominates and RAG provides targeted factual reinforcement. Across diverse datasets, the work demonstrates the value of domain adaptation for medical reasoning and supports its application to clinical decision support and biomedical research tools. The findings also reveal limitations, such as BLEURT bottlenecks and variable RAG impact, guiding future work toward improved evaluation, retrieval strategies, and human-in-the-loop refinement.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities across natural language processing tasks. However, their application to specialized domains such as medicine and biology requires further optimization to ensure factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a domain-adapted biomedical question-answering model designed to enhance both short-form and long-form queries. By integrating fine-tuning and retrieval-augmented generation (RAG), MedBioLM dynamically incorporates domain-specific knowledge, improving reasoning abilities and factual accuracy. To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA datasets, covering structured multiple-choice assessments and complex clinical reasoning tasks. Fine-tuning significantly improves accuracy on benchmark datasets, while RAG enhances factual consistency. These results highlight the potential of domain-optimized LLMs in advancing biomedical research, medical education, and clinical decision support.

Paper Structure

This paper contains 14 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparative performance of MedBioLM and base models on closed-ended and short-form biomedical QA tasks, highlighting the benefits of fine-tuning.
  • Figure 2: Overview of our approach for optimizing large language models (LLMs) in biomedical question answering, integrating fine-tuning, retrieval-augmented generation (RAG), and prompt engineering to enhance performance across different QA formats.
  • Figure 3: Illustration of the Retrieval-Augmented Generation (RAG) process. The system consists of three main components: (1) Query Encoder, which processes the input query into tokenized representations ($T_1, T_2, \dots, T_n$), (2) Knowledge Searching and Retrieving, where the system performs document cracking, chunking, and index projection to retrieve relevant knowledge ($K_1, K_2, \dots, K_n$), and (3) Answer Generator, which integrates retrieved data into the response generation process. Here, $T_i$ represents tokenized query input, while $K_i$ denotes retrieved knowledge chunks. This approach enhances factual accuracy by incorporating external knowledge into the model’s output.
  • Figure 4: Impact of increasing Top-K on MedQA short-form QA. As the number of retrieved documents increases, the performance of all evaluation metrics decreases. Given the nature of the task, which expects concise short-form answers, retrieving more documents introduces noise and conflicting information, negatively affecting answer quality.
  • Figure 5: Pairwise evaluation of long-form answers comparing GPT-4o and MedBioLM across five key criteria: overall quality, coherence, succinctness, coverage, and accuracy. Bars represent the percentage of responses where GPT-4o was preferred (blue), MedBioLM was preferred (purple), or the responses were rated as tied (gray).
  • ...and 1 more figures