Table of Contents
Fetching ...

Adaptive Reinforcement for Open-ended Medical Reasoning via Semantic-Guided Reward Collapse Mitigation

Yizhou Liu, Dingkang Yang, Zizhi Chen, Minghao Han, Xukun Zhang, Keliang Liu, Jingwei Wei, Lihua Zhang

TL;DR

Adaptive Reinforcement for Medical Reasoning (ARMed), a novel RL framework tailored for open-ended medical VQA, which substantially improves both accuracy and generalization and highlights the potential of adaptive semantic rewards for building robust, clinically reliable multimodal reasoning systems.

Abstract

Reinforcement learning (RL) with rule-based reward functions has recently shown great promise in enhancing the reasoning depth and generalization ability of vision-language models (VLMs), while maintaining computational efficiency. In spite of these advances, its adoption in medical imaging remains limited. Current reinforcement fine-tuning (RFT) efforts in this field mainly focus on closed-ended visual question answering (VQA), restricting their applicability to realistic clinical reasoning. However, open-ended medical VQA better mirrors clinical diagnostic workflows but remains underexplored. Although several studies have attempted to bridge the two formats through semantically guided RL, model-driven semantic rewards often suffer from reward collapse, where responses with distinct semantics yield nearly identical scores. To overcome this limitation, we introduce Adaptive Reinforcement for Medical Reasoning (ARMed), a novel RL framework tailored for open-ended medical VQA. ARMed first injects domain expertise through supervised fine-tuning (SFT) on chain-of-thought annotations, followed by reinforcement optimization using textual correctness and adaptive semantic rewards to refine reasoning consistency and factual accuracy. Extensive experiments on six challenging medical VQA benchmarks demonstrate that ARMed substantially improves both accuracy and generalization. These findings underscore the importance of reward discriminability in medical RL and highlight the potential of adaptive semantic rewards for building robust, clinically reliable multimodal reasoning systems.

Adaptive Reinforcement for Open-ended Medical Reasoning via Semantic-Guided Reward Collapse Mitigation

TL;DR

Adaptive Reinforcement for Medical Reasoning (ARMed), a novel RL framework tailored for open-ended medical VQA, which substantially improves both accuracy and generalization and highlights the potential of adaptive semantic rewards for building robust, clinically reliable multimodal reasoning systems.

Abstract

Reinforcement learning (RL) with rule-based reward functions has recently shown great promise in enhancing the reasoning depth and generalization ability of vision-language models (VLMs), while maintaining computational efficiency. In spite of these advances, its adoption in medical imaging remains limited. Current reinforcement fine-tuning (RFT) efforts in this field mainly focus on closed-ended visual question answering (VQA), restricting their applicability to realistic clinical reasoning. However, open-ended medical VQA better mirrors clinical diagnostic workflows but remains underexplored. Although several studies have attempted to bridge the two formats through semantically guided RL, model-driven semantic rewards often suffer from reward collapse, where responses with distinct semantics yield nearly identical scores. To overcome this limitation, we introduce Adaptive Reinforcement for Medical Reasoning (ARMed), a novel RL framework tailored for open-ended medical VQA. ARMed first injects domain expertise through supervised fine-tuning (SFT) on chain-of-thought annotations, followed by reinforcement optimization using textual correctness and adaptive semantic rewards to refine reasoning consistency and factual accuracy. Extensive experiments on six challenging medical VQA benchmarks demonstrate that ARMed substantially improves both accuracy and generalization. These findings underscore the importance of reward discriminability in medical RL and highlight the potential of adaptive semantic rewards for building robust, clinically reliable multimodal reasoning systems.

Paper Structure

This paper contains 32 sections, 23 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Illustration of the reward collapse issue caused by continuous semantic rewards and comparison among different GRPO variants. The upper part shows the vanilla GRPO trained in closed-ended settings, while real-world tasks are often open-ended. The lower part depicts how semantic-based GRPO can suffer from reward collapse due to the continuity of semantic rewards. Our proposed ARMed method adaptively mitigates this problem, aligning learning between open- and closed-ended scenarios.
  • Figure 2: Overview of the ARMed framework in a medical QA task. (a) illustrates the GRPO training process; (b) shows how ARMed’s adaptive semantic reward mitigates reward collapse; and (c) outlines the three-stage training pipeline: Reward-driven Pretraining, Knowledge-enhanced Fine-tuning, and Reward-based Refinement.
  • Figure 3: Example of ARMed in a medical QA task. This illustrates why textual similarity alone is insufficient, how naive semantic rewards can collapse, and how ARMed enables stable and meaningful semantic evaluation for reliable medical reasoning.
  • Figure 4: Distribution of model semantic rewards during training. It compares the distributions of non-discriminative (GRPO) and adaptive (ARMed-I) rewards.
  • Figure 5: Ablation study results of different model variants on six medical benchmarks. “Avg.” denotes open-ended QA datasets (average of multiple metrics), and “acc.” represents closed-ended QA datasets (accuracy). The radar chart illustrates the contribution of text, semantic, and adaptive modules, as well as data augmentation strategies, to overall performance.
  • ...and 9 more figures