Table of Contents
Fetching ...

On the Robustness of Answer Formats in Medical Reasoning Models

Pittawat Taveekitworachai, Natpatchara Pongjirapat, Krittaphas Chaisutyakorn, Piyalitt Ittichaiwong, Tossaporn Saengja, Kunat Pipatanakul

TL;DR

It is found that supervised fine-tuning yields more stable behavior across formats, whereas reinforcement fine-tuning often exhibits higher cross-format brittleness, with the degree of instability strongly dependent on reward design.

Abstract

Medical reasoning models (MRMs) achieve superior performance on medical benchmarks compared to medical LLMs; however, high accuracy alone is insufficient for practical deployment. One of such requirements for real-world application is robustness to varying output constraints. Specifically, posing the same medical question while requesting different answer formats should not affect the underlying correctness of the response. We investigate this phenomenon in this paper, focusing on MRMs. To quantify this behavior, we propose the metric answer-format robustness: the ability to reliably generate correct outputs across varying specified formats. We examine three representative formats: multiple-choice, open-ended question-answering, and ranked lists. Across 15 proprietary and open-weight models, we observe substantial variation in format robustness (35-100%). Furthermore, we conduct controlled fine-tuning experiments on a shared backbone with matched training data to isolate the effects of the fine-tuning paradigm. We find that supervised fine-tuning yields more stable behavior across formats, whereas reinforcement fine-tuning often exhibits higher cross-format brittleness, with the degree of instability strongly dependent on reward design. Overall, answer-format robustness in MRMs is trainable yet brittle and requires careful evaluation for practical medical use.

On the Robustness of Answer Formats in Medical Reasoning Models

TL;DR

It is found that supervised fine-tuning yields more stable behavior across formats, whereas reinforcement fine-tuning often exhibits higher cross-format brittleness, with the degree of instability strongly dependent on reward design.

Abstract

Medical reasoning models (MRMs) achieve superior performance on medical benchmarks compared to medical LLMs; however, high accuracy alone is insufficient for practical deployment. One of such requirements for real-world application is robustness to varying output constraints. Specifically, posing the same medical question while requesting different answer formats should not affect the underlying correctness of the response. We investigate this phenomenon in this paper, focusing on MRMs. To quantify this behavior, we propose the metric answer-format robustness: the ability to reliably generate correct outputs across varying specified formats. We examine three representative formats: multiple-choice, open-ended question-answering, and ranked lists. Across 15 proprietary and open-weight models, we observe substantial variation in format robustness (35-100%). Furthermore, we conduct controlled fine-tuning experiments on a shared backbone with matched training data to isolate the effects of the fine-tuning paradigm. We find that supervised fine-tuning yields more stable behavior across formats, whereas reinforcement fine-tuning often exhibits higher cross-format brittleness, with the degree of instability strongly dependent on reward design. Overall, answer-format robustness in MRMs is trainable yet brittle and requires careful evaluation for practical medical use.

Paper Structure

This paper contains 118 sections, 2 equations, 47 figures, 36 tables.

Figures (47)

  • Figure 1: Overview of our study on answer-format robustness in MRMs. Left: Examples of MCQ, QA, and ranked-list formats using the same medical question. Right: Investigation pipeline, comprising observational prompting analysis and controlled fine-tuning experiments, along with the evaluation metrics used.
  • Figure 2: Answer-format robustness varies from 35.91% (HuatuoGPT-o1) to 99.78% (Qwen3 4B), with MRMs showing lower average robustness than general LLMs.
  • Figure 3: (a) Comparison of MCQ and List accuracies shows that most models achieve higher List accuracy, indicating format-dependent differences in knowledge access. (b) Per-question correctness transitions reveal strong asymmetry, with substantial degradation from MCQ to QA and high stability from QA to List@1.
  • Figure 4: CoT tends to degrade performance for most models. Some reasoning models (OpenThinker3, HuatuoGPT-o1, m1) show degradation, while LLMs like Qwen2.5 7B benefit from CoT.
  • Figure 5: Robustness and performance for the backbone model (Baseline) and models fine-tuned (SFT, RFT) on specific answer formats, evaluated across MCQ, QA, and List. Each panel shows one training-format target. (a) Robustness: MCQ-trained models generalize well across formats for both SFT and RFT. (b) Performance: Fine-tuned models perform best on their training format; robustness failures in RFT lead to sharp accuracy drops on unseen formats. Note that RFT-List uses a standard accuracy-based reward; alternative list-specific rewards are analyzed in \ref{['finding:reward_design']}.
  • ...and 42 more figures