Table of Contents
Fetching ...

Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

Zihao Zhao, Frederik Hauke, Juliana De Castilhos, Sven Nebelung, Daniel Truhn

TL;DR

Preliminary insights into zero-shot agent performance in visually confounded scenarios are provided and experimental results show improved diagnostic performance and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment.

Abstract

The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.

Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

TL;DR

Preliminary insights into zero-shot agent performance in visually confounded scenarios are provided and experimental results show improved diagnostic performance and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment.

Abstract

The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.
Paper Structure (10 sections, 3 equations, 3 figures, 1 table)

This paper contains 10 sections, 3 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Illustration of hard-to-separate disease pairs. Despite highly overlapping visual patterns, their typical etiologies and managements differ significantly, which makes imaging-only differentiation challenging and high-stakes.
  • Figure 2: The overview of Contrastive Agent Reasoning (CARE). Two disease-specific agents generate opposing evidence from the same input image (e.g., melanoma vs. atypical nevus). A judge agent adjudicates the arguments, flags unsupported evidence, and outputs the final diagnosis in a training-free, zero-shot setting.
  • Figure 3: Representative cases showing how CARE exposes contradictory findings, implements cross-agent evidence recalibration, and verifies claims against the image.