Table of Contents
Fetching ...

MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling

Wenjie Li, Yujie Zhang, Haoran Sun, Xingqi He, Hongcheng Gao, Chenglong Ma, Ming Hu, Guankun Wang, Shiyi Yao, Renhao Yang, Hongliang Ren, Lei Wang, Junjun He, Yankai Jiang

TL;DR

MedScope introduces a tool-using framework for long-form clinical video reasoning that performs coarse-to-fine evidence seeking and verification via a structured trajectory of reasoning and tool interactions. It is supported by ClinVideoSuite, an evidence-centric data pipeline for dense, temporally localized supervision, and GA-GRPO, a grounding-aware reinforcement learning objective that reinforces temporally aligned tool use and evidence fidelity. Across SVU-31K and ClinVideo-Eval, MedScope achieves state-of-the-art results on multi-grained video understanding, fine-grained reasoning, and grounded VQA, with strong generalization to out-of-domain data. The work advances toward reliable, verifiable medical AI agents that genuinely "think with videos" through integrated reasoning and evidence-grounded tool use.

Abstract

Long-form clinical videos are central to visual evidence-based decision-making, with growing importance for applications such as surgical robotics and related settings. However, current multimodal large language models typically process videos with passive sampling or weakly grounded inspection, which limits their ability to iteratively locate, verify, and justify predictions with temporally targeted evidence. To close this gap, we propose MedScope, a tool-using clinical video reasoning model that performs coarse-to-fine evidence seeking over long-form procedures. By interleaving intermediate reasoning with targeted tool calls and verification on retrieved observations, MedScope produces more accurate and trustworthy predictions that are explicitly grounded in temporally localized visual evidence. To address the lack of high-fidelity supervision, we build ClinVideoSuite, an evidence-centric, fine-grained clinical video suite. We then optimize MedScope with Grounding-Aware Group Relative Policy Optimization (GA-GRPO), which directly reinforces tool use with grounding-aligned rewards and evidence-weighted advantages. On full and fine-grained video understanding benchmarks, MedScope achieves state-of-the-art performance in both in-domain and out-of-domain evaluations. Our approach illuminates a path toward medical AI agents that can genuinely "think with videos" through tool-integrated reasoning. We will release our code, models, and data.

MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling

TL;DR

MedScope introduces a tool-using framework for long-form clinical video reasoning that performs coarse-to-fine evidence seeking and verification via a structured trajectory of reasoning and tool interactions. It is supported by ClinVideoSuite, an evidence-centric data pipeline for dense, temporally localized supervision, and GA-GRPO, a grounding-aware reinforcement learning objective that reinforces temporally aligned tool use and evidence fidelity. Across SVU-31K and ClinVideo-Eval, MedScope achieves state-of-the-art results on multi-grained video understanding, fine-grained reasoning, and grounded VQA, with strong generalization to out-of-domain data. The work advances toward reliable, verifiable medical AI agents that genuinely "think with videos" through integrated reasoning and evidence-grounded tool use.

Abstract

Long-form clinical videos are central to visual evidence-based decision-making, with growing importance for applications such as surgical robotics and related settings. However, current multimodal large language models typically process videos with passive sampling or weakly grounded inspection, which limits their ability to iteratively locate, verify, and justify predictions with temporally targeted evidence. To close this gap, we propose MedScope, a tool-using clinical video reasoning model that performs coarse-to-fine evidence seeking over long-form procedures. By interleaving intermediate reasoning with targeted tool calls and verification on retrieved observations, MedScope produces more accurate and trustworthy predictions that are explicitly grounded in temporally localized visual evidence. To address the lack of high-fidelity supervision, we build ClinVideoSuite, an evidence-centric, fine-grained clinical video suite. We then optimize MedScope with Grounding-Aware Group Relative Policy Optimization (GA-GRPO), which directly reinforces tool use with grounding-aligned rewards and evidence-weighted advantages. On full and fine-grained video understanding benchmarks, MedScope achieves state-of-the-art performance in both in-domain and out-of-domain evaluations. Our approach illuminates a path toward medical AI agents that can genuinely "think with videos" through tool-integrated reasoning. We will release our code, models, and data.
Paper Structure (94 sections, 15 equations, 29 figures, 5 tables, 1 algorithm)

This paper contains 94 sections, 15 equations, 29 figures, 5 tables, 1 algorithm.

Figures (29)

  • Figure 1: Performance comparison on full and fine-grained video understanding and VQA benchmarks. Left: grounded VQA accuracy on ClinVideo-Eval. Right: full and fine-grained video understanding quality on SVU-31K.
  • Figure 2: Comparison between textual CoT and visual CoT for evidence-grounded clinical video reasoning. Left: textual CoT shows overconfident hallucinations (red), inventing rationales and predicting the wrong instrument. Right: visual CoT iteratively retrieves and integrates dense visual evidence via tool calls, grounding reasoning in localized observations and producing the correct answer.
  • Figure 3: Overview of MedScope. (a) Coarse-to-fine clinical reasoning with explicit thought and tool actions that progressively retrieve temporally targeted dense evidence for verification. (b) Three-stage training pipeline: clinical reasoning warm-up, visual-CoT cold-start SFT on ClinVideoSuite, and agentic tool reinforcement learning with grounding-aware rewards and advantage shaping.
  • Figure 4: ClinVideoSuite data synthesis pipeline. Stage 1 builds evidence-centric dense captions and global summaries. Stage 2 generates and filters QA with text checks and multimodal verification to enforce localized evidence dependence. Stage 3 collects tool-augmented visual CoT trajectories via native tool interaction in a real video environment.
  • Figure 5: Ablations on reward design. We compare Ours with three variants: w/o $R_{\text{evidence}}$ removes the evidence reward; conditional $R_{\text{evidence}}$ applies it only when $R_{\text{acc}}{=}1$; w/o IoU bonus removes the continuous IoU bonus in $R_{\text{evidence}}$.
  • ...and 24 more figures