A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text?
Julio Silva-Rodríguez, Jose Dolz, Ismail Ben Ayed
TL;DR
This paper challenges the prevailing view that vision-language pre-training yields clear advantages in radiology by showing that unimodal pre-training with fine-grained labels is highly competitive, especially when data are scarce. It introduces a Disentangled Language-Image-Label Pre-training (DLILP) framework that separably optimizes image-label and image-text signals via distinct projections, blending a label-focused cross-entropy objective with a CLIP-style contrastive loss. Across seven downstream chest X-ray tasks, unimodal methods outperform many vision-language baselines, and DLILP delivers robust zero-shot and improved few-shot generalization while effectively integrating heterogeneous data. The findings suggest revisiting strong unimodal baselines, adopting more nuanced evaluation of base vs novel findings, and leveraging DLILP to better balance label supervision and text supervision in medical vision models.
Abstract
Vision-language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources. This paradigm has quickly made its way into the medical image analysis community. In particular, there is an impressive amount of recent literature developing vision-language models for radiology. However, the available medical datasets with image-text supervision are scarce, and medical concepts are fine-grained, involving expert knowledge that existing vision-language models struggle to encode. In this paper, we propose to take a prudent step back from the literature and revisit supervised, unimodal pre-training, using fine-grained labels instead. We conduct an extensive comparison demonstrating that unimodal pre-training is highly competitive and better suited to integrating heterogeneous data sources. Our results also question the potential of recent vision-language models for open-vocabulary generalization, which have been evaluated using optimistic experimental settings. Finally, we study novel alternatives to better integrate fine-grained labels and noisy text supervision.
