To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering

Zaifu Zhan; Min Zeng; Shuang Zhou; Yiran Song; Xiaoyi Chen; Yu Hou; Yifan Wu; Yang Ruan; Rui Zhang

To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering

Zaifu Zhan, Min Zeng, Shuang Zhou, Yiran Song, Xiaoyi Chen, Yu Hou, Yifan Wu, Yang Ruan, Rui Zhang

TL;DR

Selective Chain-of-Thought (Selective CoT), an inference-time strategy that first predicts whether a question requires reasoning and generates a rationale only when needed, provides a simple, model-agnostic, and cost-effective approach for medical QA.

Abstract

Objective: To improve the efficiency of medical question answering (MedQA) with large language models (LLMs) by avoiding unnecessary reasoning while maintaining accuracy. Methods: We propose Selective Chain-of-Thought (Selective CoT), an inference-time strategy that first predicts whether a question requires reasoning and generates a rationale only when needed. Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA. Metrics included accuracy, total generated tokens, and inference time. Results: Selective CoT reduced inference time by 13-45% and token usage by 8-47% with minimal accuracy loss ($\leq$4\%). In some model-task pairs, it achieved both higher accuracy and greater efficiency than standard CoT. Compared with fixed-length CoT, Selective CoT reached similar or superior accuracy at substantially lower computational cost. Discussion: Selective CoT dynamically balances reasoning depth and efficiency by invoking explicit reasoning only when beneficial, reducing redundancy on recall-type questions while preserving interpretability. Conclusion: Selective CoT provides a simple, model-agnostic, and cost-effective approach for medical QA, aligning reasoning effort with question complexity to enhance real-world deployability of LLM-based clinical systems.

To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering

TL;DR

Abstract

4\%). In some model-task pairs, it achieved both higher accuracy and greater efficiency than standard CoT. Compared with fixed-length CoT, Selective CoT reached similar or superior accuracy at substantially lower computational cost. Discussion: Selective CoT dynamically balances reasoning depth and efficiency by invoking explicit reasoning only when beneficial, reducing redundancy on recall-type questions while preserving interpretability. Conclusion: Selective CoT provides a simple, model-agnostic, and cost-effective approach for medical QA, aligning reasoning effort with question complexity to enhance real-world deployability of LLM-based clinical systems.

Paper Structure (20 sections, 3 figures, 2 tables)

This paper contains 20 sections, 3 figures, 2 tables.

Introduction
Methods
Overview of methods
Task and datasets
Selective Chain-of-thought
Prompt
Models
Metrics
Experiments
Results
Main results
Ablation Study: Reasoning Length vs. Selective CoT
Discussion
Conclusion
Data Availability
...and 5 more sections

Figures (3)

Figure 1: Overview of Standard Prompt, Chain-of-thought, and Selective CoT prompting (ours) on a representative MedQA case. Left: an example clinical vignette (multiple-choice from MedQA-USMLE dataset). Middle: three inference paradigms—Standard Prompt (answer directly), CoT Prompt (produce step-by-step reasoning before the answer), and Selective CoT Prompt (first decide whether reasoning is necessary; if yes, generate CoT, otherwise answer directly). Right: a comparison illustrating that full CoT improves logicality and interpretability when reasoning is required, but wastes compute on recall-type questions; Selective CoT preserves accuracy while improving efficiency and cost-effectiveness by invoking reasoning only when needed.
Figure 2: Performance and efficiency comparison of Selective CoT versus fixed-length CoT across four biomedical QA datasets (HeadQA, MedMCQA, MedQA-USMLE, PubMedQA) and two open-source LLMs (Llama-3.1-8B, Qwen2.5-7B). The three rows report, respectively, Accuracy, #Tokens, and Inference Time. Selective CoT matches or slightly outperforms strong fixed-length CoT baselines while substantially reducing token usage and latency, yielding a superior compute--performance trade-off.
Figure 3: Ablation on reasoning length versus Selective CoT. For fixed-length CoT, we sweep reasoning lengths (e.g., 100--600 words) and fit a dashed quadratic curve to the length--accuracy relationship; Selective CoT is marked as a red point for reference. Accuracy under Selective CoT typically lies near the empirical optimum and often on or above the fitted curve for several datasets, achieving comparable or higher accuracy with fewer tokens and shorter time than long, uniform CoT.

To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering

TL;DR

Abstract

To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (3)