Table of Contents
Fetching ...

Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs

Chenqian Le, Ziheng Gong, Chihang Wang, Haowei Ni, Panfeng Li, Xupeng Chen

TL;DR

This work evaluates how prompt design (standard vs Chain-of-Thought) and lightweight instruction fine-tuning via 4-bit QLoRA influence biomedical question answering with open-source LLMs on PubMedQA. It systematically compares base and instruction-tuned models across four architectures, highlighting that CoT prompts improve zero-shot reasoning while instruction tuning boosts accuracy, though CoT-based fine-tuning yields model- and scale-dependent results. The findings underscore the need for careful alignment between reasoning prompts and model tuning, revealing that larger models may not benefit from CoT after fine-tuning. Practically, the study offers guidance for combining prompt engineering with efficient finetuning to advance medical QA applications while managing resource constraints.

Abstract

Large language models (LLMs) have shown great potential in medical question answering (MedQA), yet adapting them to biomedical reasoning remains challenging due to domain-specific complexity and limited supervision. In this work, we study how prompt design and lightweight fine-tuning affect the performance of open-source LLMs on PubMedQA, a benchmark for multiple-choice biomedical questions. We focus on two widely used prompting strategies - standard instruction prompts and Chain-of-Thought (CoT) prompts - and apply QLoRA for parameter-efficient instruction tuning. Across multiple model families and sizes, our experiments show that CoT prompting alone can improve reasoning in zero-shot settings, while instruction tuning significantly boosts accuracy. However, fine-tuning on CoT prompts does not universally enhance performance and may even degrade it for certain larger models. These findings suggest that reasoning-aware prompts are useful, but their benefits are model- and scale-dependent. Our study offers practical insights into combining prompt engineering with efficient finetuning for medical QA applications.

Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs

TL;DR

This work evaluates how prompt design (standard vs Chain-of-Thought) and lightweight instruction fine-tuning via 4-bit QLoRA influence biomedical question answering with open-source LLMs on PubMedQA. It systematically compares base and instruction-tuned models across four architectures, highlighting that CoT prompts improve zero-shot reasoning while instruction tuning boosts accuracy, though CoT-based fine-tuning yields model- and scale-dependent results. The findings underscore the need for careful alignment between reasoning prompts and model tuning, revealing that larger models may not benefit from CoT after fine-tuning. Practically, the study offers guidance for combining prompt engineering with efficient finetuning to advance medical QA applications while managing resource constraints.

Abstract

Large language models (LLMs) have shown great potential in medical question answering (MedQA), yet adapting them to biomedical reasoning remains challenging due to domain-specific complexity and limited supervision. In this work, we study how prompt design and lightweight fine-tuning affect the performance of open-source LLMs on PubMedQA, a benchmark for multiple-choice biomedical questions. We focus on two widely used prompting strategies - standard instruction prompts and Chain-of-Thought (CoT) prompts - and apply QLoRA for parameter-efficient instruction tuning. Across multiple model families and sizes, our experiments show that CoT prompting alone can improve reasoning in zero-shot settings, while instruction tuning significantly boosts accuracy. However, fine-tuning on CoT prompts does not universally enhance performance and may even degrade it for certain larger models. These findings suggest that reasoning-aware prompts are useful, but their benefits are model- and scale-dependent. Our study offers practical insights into combining prompt engineering with efficient finetuning for medical QA applications.

Paper Structure

This paper contains 20 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of the framework for medical question answering using LLMs with standard vs. Chain-of-Thought prompting.
  • Figure 2: Model performance (Accuracy and F1) across different settings: Default vs. CoT prompt; Base vs. Fine-tuned.