Comprehensive Evaluation of Multimodal AI Models in Medical Imaging Diagnosis: From Data Augmentation to Preference-Based Comparison
Cailian Ruan, Chengyue Huang, Yahe Yang
TL;DR
The paper develops a robust evaluation framework for multimodal AI in complex abdominal CT diagnosis, expanding a 500-case dataset to 3,000 via controlled augmentation and using a three-way preference test with Claude 3.5 Sonnet to compare AI-generated and physician diagnoses. It compares six multimodal systems (four general-purpose large models and two vision-focused architectures) and demonstrates that general-purpose systems often outperform human diagnoses in aggregated preference, with Llama 3.2-90B achieving AI Superior in 85.27% of cases. The study highlights the strengths of integrated image–text reasoning over vision-only approaches and proposes a scalable, standardized methodology for clinical evaluation and decision support. While showing promising implications for AI-assisted diagnostics, it also calls for broader validation across modalities and settings and for hybrid workflows that combine AI insights with physician expertise.
Abstract
This study introduces an evaluation framework for multimodal models in medical imaging diagnostics. We developed a pipeline incorporating data preprocessing, model inference, and preference-based evaluation, expanding an initial set of 500 clinical cases to 3,000 through controlled augmentation. Our method combined medical images with clinical observations to generate assessments, using Claude 3.5 Sonnet for independent evaluation against physician-authored diagnoses. The results indicated varying performance across models, with Llama 3.2-90B outperforming human diagnoses in 85.27% of cases. In contrast, specialized vision models like BLIP2 and Llava showed preferences in 41.36% and 46.77% of cases, respectively. This framework highlights the potential of large multimodal models to outperform human diagnostics in certain tasks.
