Benchmarking the Thinking Mode of Multimodal Large Language Models in Clinical Tasks

Jindong Hong; Tianjie Chen; Lingjie Luo; Chuanyang Zheng; Ting Xu; Haibao Yu; Jianing Qiu; Qianzhong Chen; Suning Huang; Yan Xu; Yong Gui; Yijun He; Jiankai Sun

Benchmarking the Thinking Mode of Multimodal Large Language Models in Clinical Tasks

Jindong Hong, Tianjie Chen, Lingjie Luo, Chuanyang Zheng, Ting Xu, Haibao Yu, Jianing Qiu, Qianzhong Chen, Suning Huang, Yan Xu, Yong Gui, Yijun He, Jiankai Sun

TL;DR

This study investigates whether explicit thinking mode in dual-state multimodal LLMs improves clinical task performance. By evaluating Seed1.5-VL and Gemini-2.5-Flash on four medical visual tasks using VQA-RAD and ROCOv2, the authors quantify changes in accuracy, consistency, and latency when thinking is enabled versus disabled. The results show only modest gains from thinking mode, with larger benefits emerging on more complex tasks, but overall performance on highly complex medical tasks remains suboptimal and output consistency often declines. The work highlights the need for domain-specific medical data and better integration of medical knowledge to realize the potential of reasoning-enabled MLLMs in clinical settings.

Abstract

A recent advancement in Multimodal Large Language Models (MLLMs) research is the emergence of "reasoning MLLMs" that offer explicit control over their internal thinking processes (normally referred as the "thinking mode") alongside the standard "non-thinking mode". This capability allows these models to engage in a step-by-step process of internal deliberation before generating a final response. With the rapid transition to and adoption of these "dual-state" MLLMs, this work rigorously evaluated how the enhanced reasoning processes of these MLLMs impact model performance and reliability in clinical tasks. This paper evaluates the active "thinking mode" capabilities of two leading MLLMs, Seed1.5-VL and Gemini-2.5-Flash, for medical applications. We assessed their performance on four visual medical tasks using VQA-RAD and ROCOv2 datasets. Our findings reveal that the improvement from activating the thinking mode remains marginal compared to the standard non-thinking mode for the majority of the tasks. Their performance on complex medical tasks such as open-ended VQA and medical image interpretation remains suboptimal, highlighting the need for domain-specific medical data and more advanced methods for medical knowledge integration.

Benchmarking the Thinking Mode of Multimodal Large Language Models in Clinical Tasks

TL;DR

Abstract

Benchmarking the Thinking Mode of Multimodal Large Language Models in Clinical Tasks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)