Table of Contents
Fetching ...

RVLM: Recursive Vision-Language Models with Adaptive Depth

Nicanor Mayumu, Zeenath Khan, Melodena Stephens, Patrick Mukala, Farhad Oroumchian

Abstract

Medical AI systems face two fundamental limitations. First, conventional vision-language models (VLMs) perform single-pass inference, yielding black-box predictions that cannot be audited or explained in clinical terms. Second, iterative reasoning systems that expose intermediate steps rely on fixed iteration budgets wasting compute on simple cases while providing insufficient depth for complex ones. We address both limitations with a unified framework. RVLM replaces single-pass inference with an iterative generate-execute loop: at each step, the model writes Python code, invokes vision sub-agents, manipulates images, and accumulates evidence. Every diagnostic claim is grounded in executable code, satisfying auditability requirements of clinical AI governance frameworks. RRouter makes iteration depth adaptive: a lightweight controller predicts the optimal budget from task-complexity features, then monitors progress and terminates early when reasoning stalls. We evaluate on BraTS 2023 Meningioma (brain MRI) and MIMIC-CXR (chest X-ray) using Gemini 2.5 Flash without fine-tuning. Across repeated runs, RVLM shows high consistency on salient findings (e.g., mass presence and enhancement) and can detect cross-modal discrepancies between Fluid-Attenuated Inversion Recovery (FLAIR) signal characteristics and segmentation boundaries. On MIMIC-CXR, it generates structured reports and correctly recognises view-specific artefacts. Code: https://github.com/nican2018/rvlm.

RVLM: Recursive Vision-Language Models with Adaptive Depth

Abstract

Medical AI systems face two fundamental limitations. First, conventional vision-language models (VLMs) perform single-pass inference, yielding black-box predictions that cannot be audited or explained in clinical terms. Second, iterative reasoning systems that expose intermediate steps rely on fixed iteration budgets wasting compute on simple cases while providing insufficient depth for complex ones. We address both limitations with a unified framework. RVLM replaces single-pass inference with an iterative generate-execute loop: at each step, the model writes Python code, invokes vision sub-agents, manipulates images, and accumulates evidence. Every diagnostic claim is grounded in executable code, satisfying auditability requirements of clinical AI governance frameworks. RRouter makes iteration depth adaptive: a lightweight controller predicts the optimal budget from task-complexity features, then monitors progress and terminates early when reasoning stalls. We evaluate on BraTS 2023 Meningioma (brain MRI) and MIMIC-CXR (chest X-ray) using Gemini 2.5 Flash without fine-tuning. Across repeated runs, RVLM shows high consistency on salient findings (e.g., mass presence and enhancement) and can detect cross-modal discrepancies between Fluid-Attenuated Inversion Recovery (FLAIR) signal characteristics and segmentation boundaries. On MIMIC-CXR, it generates structured reports and correctly recognises view-specific artefacts. Code: https://github.com/nican2018/rvlm.
Paper Structure (58 sections, 2 equations, 3 figures, 8 tables, 2 algorithms)

This paper contains 58 sections, 2 equations, 3 figures, 8 tables, 2 algorithms.

Figures (3)

  • Figure 1: Rvlm System Architecture. The core VLM $\mathcal{M}_V$ operates within a persistent Python REPL environment $\mathcal{E}_V$ that maintains state across iterations. The environment provides first-class image storage (context_images), vision-specific query primitives, and full programmatic image manipulation. At each iteration $t$, the model generates executable code, which is executed in the REPL; outputs are appended to the message history. The loop terminates upon emission of a FINAL or FINAL_VAR signal.
  • Figure 2: Auto-generated clinical PDF report-BraTS-MEN-00004-000. Five labelled sections, mask-statistics table, and AI disclaimer footer. Intended for clinical review, not autonomous diagnosis.
  • Figure 3: Auto-generated clinical PDF report-MIMIC-CXR subject 10000032. Five clinical sections, a Ground Truth Reference block for radiologist comparison, and an AI disclaimer. The model correctly identifies the AP portable projection and qualifies its cardiac size estimate accordingly.