Table of Contents
Fetching ...

UCAgents: Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making

Qianhan Feng, Zhongzhen Huang, Yakun Zhu, Xiaofan Zhang, Qi Dou

TL;DR

The paper tackles the reliability gap in medical visual question answering caused by reasoning detachment and textual noise in multi-agent debates. It introduces UCAgents, a three-tier hierarchical framework that enforces unidirectional convergence to anchor reasoning to visual evidence: Tier-1 produces independent hypotheses, Tier-2 purifies consensus via visual-textual alignment, and Tier-3 conducts targeted adversarial risk auditing for final arbitration. Across four medical VQA benchmarks and multiple backbones, including GPT-4 and open-source models, UCAgents achieves higher accuracy and dramatically reduces token usage, while improving interpretability and robustness. The approach demonstrates strong potential for clinically relevant deployment, though its gains are bounded by the perceptual capabilities of the underlying visual encoder and the quality of input images.

Abstract

Vision-Language Models (VLMs) show promise in medical diagnosis, yet suffer from reasoning detachment, where linguistically fluent explanations drift from verifiable image evidence, undermining clinical trust. Recent multi-agent frameworks simulate Multidisciplinary Team (MDT) debates to mitigate single-model bias, but open-ended discussions amplify textual noise and computational cost while failing to anchor reasoning to visual evidence, the cornerstone of medical decision-making. We propose UCAgents, a hierarchical multi-agent framework enforcing unidirectional convergence through structured evidence auditing. Inspired by clinical workflows, UCAgents forbids position changes and limits agent interactions to targeted evidence verification, suppressing rhetorical drift while amplifying visual signal extraction. In UCAgents, a one-round inquiry discussion is introduced to uncover potential risks of visual-textual misalignment. This design jointly constrains visual ambiguity and textual noise, a dual-noise bottleneck that we formalize via information theory. Extensive experiments on four medical VQA benchmarks show UCAgents achieves superior accuracy (71.3% on PathVQA, +6.0% over state-of-the-art) with 87.7% lower token cost, the evaluation results further confirm that UCAgents strikes a balance between uncovering more visual evidence and avoiding confusing textual interference. These results demonstrate that UCAgents exhibits both diagnostic reliability and computational efficiency critical for real-world clinical deployment. Code is available at https://github.com/fqhank/UCAgents.

UCAgents: Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making

TL;DR

The paper tackles the reliability gap in medical visual question answering caused by reasoning detachment and textual noise in multi-agent debates. It introduces UCAgents, a three-tier hierarchical framework that enforces unidirectional convergence to anchor reasoning to visual evidence: Tier-1 produces independent hypotheses, Tier-2 purifies consensus via visual-textual alignment, and Tier-3 conducts targeted adversarial risk auditing for final arbitration. Across four medical VQA benchmarks and multiple backbones, including GPT-4 and open-source models, UCAgents achieves higher accuracy and dramatically reduces token usage, while improving interpretability and robustness. The approach demonstrates strong potential for clinically relevant deployment, though its gains are bounded by the perceptual capabilities of the underlying visual encoder and the quality of input images.

Abstract

Vision-Language Models (VLMs) show promise in medical diagnosis, yet suffer from reasoning detachment, where linguistically fluent explanations drift from verifiable image evidence, undermining clinical trust. Recent multi-agent frameworks simulate Multidisciplinary Team (MDT) debates to mitigate single-model bias, but open-ended discussions amplify textual noise and computational cost while failing to anchor reasoning to visual evidence, the cornerstone of medical decision-making. We propose UCAgents, a hierarchical multi-agent framework enforcing unidirectional convergence through structured evidence auditing. Inspired by clinical workflows, UCAgents forbids position changes and limits agent interactions to targeted evidence verification, suppressing rhetorical drift while amplifying visual signal extraction. In UCAgents, a one-round inquiry discussion is introduced to uncover potential risks of visual-textual misalignment. This design jointly constrains visual ambiguity and textual noise, a dual-noise bottleneck that we formalize via information theory. Extensive experiments on four medical VQA benchmarks show UCAgents achieves superior accuracy (71.3% on PathVQA, +6.0% over state-of-the-art) with 87.7% lower token cost, the evaluation results further confirm that UCAgents strikes a balance between uncovering more visual evidence and avoiding confusing textual interference. These results demonstrate that UCAgents exhibits both diagnostic reliability and computational efficiency critical for real-world clinical deployment. Code is available at https://github.com/fqhank/UCAgents.

Paper Structure

This paper contains 24 sections, 10 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Unlike redundant discussion in previous multi-agent system, UCAgents uses one-round unidirectional inquiry to cut down textual noise and help focus on visual evidence.
  • Figure 1: Success Case Pattern 1: Tier-2 Output for Consensus Purification via Evidence Verification.
  • Figure 2: The overview of UCAgents. UCAgents system is composed of 3 dynamic Tiers: Initial Independent Diagnosis, Guidance Expert Review and Critical Analysis and Questioning. $H_{a}$, $H_{b}$: Divergent Candidate Hypotheses from previous tiers.
  • Figure 2: Success Case Pattern 2: Tier-3 Output for Adversarial Visual Grounding (Tier-1 Divergence). The supratentorial vs. infratentorial localization case demonstrates how unidirectional risk auditing constrains debate to observable anatomical features, preventing rhetorical drift.
  • Figure 3: Visual-evidence anchored diagnosis quality analysis. (a) Visual Evidence Coverage. UCAgents recalls more verified visual evidence than MDAgents from the image. (b) Decision Trajectory Entropy. Unidirectional Covergence mechanism reduces agents' confusion caused by noisy decision space compared to MDAgents. (c) Textual Noise Ratio. UCAgents achieves a balance between evidence sentences and distractive sentences. Outer assistant processes 3 records together at one time for fairness.
  • ...and 4 more figures