How Good is my Histopathology Vision-Language Foundation Model? A Holistic Benchmark
Roba Al Majzoub, Hashmat Malik, Muzammal Naseer, Zaigham Zaheer, Tariq Mahmood, Salman Khan, Fahad Khan
TL;DR
The paper introduces Histo-VL, a fully open-source, large-scale histopathology vision-language benchmark that unifies evaluation across 32 cancer types and 26 organs using data from 14 cohorts, 11 acquisition tools, and over 5 million patches from more than 41K WSIs. It designs diverse image-text captions (single and ensemble) and evaluates a range of VLMs on 7 clinically relevant tasks, using standard vision-language metrics plus calibration analyses. The findings reveal strong sensitivity to caption wording, poor model calibration (high ECE), and limited adversarial robustness, with magnification and stain normalization significantly affecting performance. The study also shows that some domain-specific models can approach or exceed pathologist performance on multiclass tasks, highlighting potential for clinical impact while underscoring reliability and deployment challenges. Overall, Histo-VL provides a comprehensive platform to diagnose current limitations and guide future improvements in histopathology VLMs for real-world use.
Abstract
Recently, histopathology vision-language foundation models (VLMs) have gained popularity due to their enhanced performance and generalizability across different downstream tasks. However, most existing histopathology benchmarks are either unimodal or limited in terms of diversity of clinical tasks, organs, and acquisition instruments, as well as their partial availability to the public due to patient data privacy. As a consequence, there is a lack of comprehensive evaluation of existing histopathology VLMs on a unified benchmark setting that better reflects a wide range of clinical scenarios. To address this gap, we introduce HistoVL, a fully open-source comprehensive benchmark comprising images acquired using up to 11 various acquisition tools that are paired with specifically crafted captions by incorporating class names and diverse pathology descriptions. Our Histo-VL includes 26 organs, 31 cancer types, and a wide variety of tissue obtained from 14 heterogeneous patient cohorts, totaling more than 5 million patches obtained from over 41K WSIs viewed under various magnification levels. We systematically evaluate existing histopathology VLMs on Histo-VL to simulate diverse tasks performed by experts in real-world clinical scenarios. Our analysis reveals interesting findings, including large sensitivity of most existing histopathology VLMs to textual changes with a drop in balanced accuracy of up to 25% in tasks such as Metastasis detection, low robustness to adversarial attacks, as well as improper calibration of models evident through high ECE values and low model prediction confidence, all of which can affect their clinical implementation.
