Table of Contents
Fetching ...

Structured Outputs Enable General-Purpose LLMs to be Medical Experts

Guangfu Guo, Kai Zhang, Bryan Hoo, Yujun Cai, Xiaoqian Lu, Nanyun Peng, Yiwei Wang

TL;DR

This paper tackles factuality and comprehensiveness in open-ended medical QA by proposing Med-SoCoT, a training-free structured output prompting framework that enforces a seven-step medical reasoning process via templates and stepwise generation. On MedLFQA, it achieves a peak Factuality Score of 85.8, outperforming fine-tuned baselines and transferring gains to smaller models. Ablation studies validate the necessity of each step and optimization components, with improvements reflected in Words Composition and reduced hallucinations. The approach offers a scalable alternative to domain-specific fine-tuning and shows promise for extending to other domains requiring precise, interpretable long-form reasoning.

Abstract

Medical question-answering (QA) is a critical task for evaluating how effectively large language models (LLMs) encode clinical knowledge and assessing their potential applications in medicine. Despite showing promise on multiple-choice tests, LLMs frequently struggle with open-ended medical questions, producing responses with dangerous hallucinations or lacking comprehensive coverage of critical aspects. Existing approaches attempt to address these challenges through domain-specific fine-tuning, but this proves resource-intensive and difficult to scale across models. To improve the comprehensiveness and factuality of medical responses, we propose a novel approach utilizing structured medical reasoning. Our method guides LLMs through an seven-step cognitive process inspired by clinical diagnosis, enabling more accurate and complete answers without additional training. Experiments on the MedLFQA benchmark demonstrate that our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models. Notably, this improvement transfers to smaller models, highlighting the method's efficiency and scalability. Our code and datasets are available.

Structured Outputs Enable General-Purpose LLMs to be Medical Experts

TL;DR

This paper tackles factuality and comprehensiveness in open-ended medical QA by proposing Med-SoCoT, a training-free structured output prompting framework that enforces a seven-step medical reasoning process via templates and stepwise generation. On MedLFQA, it achieves a peak Factuality Score of 85.8, outperforming fine-tuned baselines and transferring gains to smaller models. Ablation studies validate the necessity of each step and optimization components, with improvements reflected in Words Composition and reduced hallucinations. The approach offers a scalable alternative to domain-specific fine-tuning and shows promise for extending to other domains requiring precise, interpretable long-form reasoning.

Abstract

Medical question-answering (QA) is a critical task for evaluating how effectively large language models (LLMs) encode clinical knowledge and assessing their potential applications in medicine. Despite showing promise on multiple-choice tests, LLMs frequently struggle with open-ended medical questions, producing responses with dangerous hallucinations or lacking comprehensive coverage of critical aspects. Existing approaches attempt to address these challenges through domain-specific fine-tuning, but this proves resource-intensive and difficult to scale across models. To improve the comprehensiveness and factuality of medical responses, we propose a novel approach utilizing structured medical reasoning. Our method guides LLMs through an seven-step cognitive process inspired by clinical diagnosis, enabling more accurate and complete answers without additional training. Experiments on the MedLFQA benchmark demonstrate that our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models. Notably, this improvement transfers to smaller models, highlighting the method's efficiency and scalability. Our code and datasets are available.

Paper Structure

This paper contains 37 sections, 4 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: A flowchart showing the doctor’s cognitive process to answer a patient’s question, involving medical analysis, relevant information, and follow-up steps.
  • Figure 2: Factuality Scores for different models (LLaMA2-7B, Meditron-7B, Mistral-7B, BioMistral-7B) across three methods: Zero-shot, OLAPH, and Med-SoCoT (Ours).
  • Figure 3: Factuality Scores for different models (Gemma-7B, LLaMA3.1-3B-INSTRUCT, GPT-3.5-Turbo) across three methods: Zero-shot, CoT and Med-SoCoT (Ours).
  • Figure 4: Model: Gemma2-7B. LLMs often generate unreliable answers, due to cognitive limitations. Structured output helps LLMs analyze problems step by step, leading to more complete and accurate answers.
  • Figure 5: Model: Gemma-7B, Dataset: LiveQA. An overview of structured medical reasoning process shows a step-by-step framework to improve the comprehensiveness and factuality of medical QA.
  • ...and 5 more figures