Adaptive Reinforcement for Open-ended Medical Reasoning via Semantic-Guided Reward Collapse Mitigation

Yizhou Liu; Dingkang Yang; Zizhi Chen; Minghao Han; Xukun Zhang; Keliang Liu; Jingwei Wei; Lihua Zhang

Adaptive Reinforcement for Open-ended Medical Reasoning via Semantic-Guided Reward Collapse Mitigation

Yizhou Liu, Dingkang Yang, Zizhi Chen, Minghao Han, Xukun Zhang, Keliang Liu, Jingwei Wei, Lihua Zhang

TL;DR

Adaptive Reinforcement for Medical Reasoning (ARMed), a novel RL framework tailored for open-ended medical VQA, which substantially improves both accuracy and generalization and highlights the potential of adaptive semantic rewards for building robust, clinically reliable multimodal reasoning systems.

Abstract

Reinforcement learning (RL) with rule-based reward functions has recently shown great promise in enhancing the reasoning depth and generalization ability of vision-language models (VLMs), while maintaining computational efficiency. In spite of these advances, its adoption in medical imaging remains limited. Current reinforcement fine-tuning (RFT) efforts in this field mainly focus on closed-ended visual question answering (VQA), restricting their applicability to realistic clinical reasoning. However, open-ended medical VQA better mirrors clinical diagnostic workflows but remains underexplored. Although several studies have attempted to bridge the two formats through semantically guided RL, model-driven semantic rewards often suffer from reward collapse, where responses with distinct semantics yield nearly identical scores. To overcome this limitation, we introduce Adaptive Reinforcement for Medical Reasoning (ARMed), a novel RL framework tailored for open-ended medical VQA. ARMed first injects domain expertise through supervised fine-tuning (SFT) on chain-of-thought annotations, followed by reinforcement optimization using textual correctness and adaptive semantic rewards to refine reasoning consistency and factual accuracy. Extensive experiments on six challenging medical VQA benchmarks demonstrate that ARMed substantially improves both accuracy and generalization. These findings underscore the importance of reward discriminability in medical RL and highlight the potential of adaptive semantic rewards for building robust, clinically reliable multimodal reasoning systems.

Adaptive Reinforcement for Open-ended Medical Reasoning via Semantic-Guided Reward Collapse Mitigation

TL;DR

Abstract

Adaptive Reinforcement for Open-ended Medical Reasoning via Semantic-Guided Reward Collapse Mitigation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)