Table of Contents
Fetching ...

A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text?

Julio Silva-Rodríguez, Jose Dolz, Ismail Ben Ayed

TL;DR

This paper challenges the prevailing view that vision-language pre-training yields clear advantages in radiology by showing that unimodal pre-training with fine-grained labels is highly competitive, especially when data are scarce. It introduces a Disentangled Language-Image-Label Pre-training (DLILP) framework that separably optimizes image-label and image-text signals via distinct projections, blending a label-focused cross-entropy objective with a CLIP-style contrastive loss. Across seven downstream chest X-ray tasks, unimodal methods outperform many vision-language baselines, and DLILP delivers robust zero-shot and improved few-shot generalization while effectively integrating heterogeneous data. The findings suggest revisiting strong unimodal baselines, adopting more nuanced evaluation of base vs novel findings, and leveraging DLILP to better balance label supervision and text supervision in medical vision models.

Abstract

Vision-language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources. This paradigm has quickly made its way into the medical image analysis community. In particular, there is an impressive amount of recent literature developing vision-language models for radiology. However, the available medical datasets with image-text supervision are scarce, and medical concepts are fine-grained, involving expert knowledge that existing vision-language models struggle to encode. In this paper, we propose to take a prudent step back from the literature and revisit supervised, unimodal pre-training, using fine-grained labels instead. We conduct an extensive comparison demonstrating that unimodal pre-training is highly competitive and better suited to integrating heterogeneous data sources. Our results also question the potential of recent vision-language models for open-vocabulary generalization, which have been evaluated using optimistic experimental settings. Finally, we study novel alternatives to better integrate fine-grained labels and noisy text supervision.

A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text?

TL;DR

This paper challenges the prevailing view that vision-language pre-training yields clear advantages in radiology by showing that unimodal pre-training with fine-grained labels is highly competitive, especially when data are scarce. It introduces a Disentangled Language-Image-Label Pre-training (DLILP) framework that separably optimizes image-label and image-text signals via distinct projections, blending a label-focused cross-entropy objective with a CLIP-style contrastive loss. Across seven downstream chest X-ray tasks, unimodal methods outperform many vision-language baselines, and DLILP delivers robust zero-shot and improved few-shot generalization while effectively integrating heterogeneous data. The findings suggest revisiting strong unimodal baselines, adopting more nuanced evaluation of base vs novel findings, and leveraging DLILP to better balance label supervision and text supervision in medical vision models.

Abstract

Vision-language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources. This paradigm has quickly made its way into the medical image analysis community. In particular, there is an impressive amount of recent literature developing vision-language models for radiology. However, the available medical datasets with image-text supervision are scarce, and medical concepts are fine-grained, involving expert knowledge that existing vision-language models struggle to encode. In this paper, we propose to take a prudent step back from the literature and revisit supervised, unimodal pre-training, using fine-grained labels instead. We conduct an extensive comparison demonstrating that unimodal pre-training is highly competitive and better suited to integrating heterogeneous data sources. Our results also question the potential of recent vision-language models for open-vocabulary generalization, which have been evaluated using optimistic experimental settings. Finally, we study novel alternatives to better integrate fine-grained labels and noisy text supervision.

Paper Structure

This paper contains 21 sections, 7 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Training transferable vision models. Radiology reports include text descriptions, from which labels are extracted through entity extractor methods. Previous methods struggle to align language-image-label information without compromising zero-shot generalization — see Section \ref{['sec:pitfalls']}. We propose DLILP, a Disentangled Language-Image-Label Pre-training that exploits text and label supervision in separate feature projections, described at Section \ref{['sec:dlilp']}.
  • Figure 2: Transferability. (a) Effect of increasing pre-training data (K=16); (b) Few-shot adaptation. Average for 7 tasks. M: MIMIC; C: CheXpert; P: PadChest.
  • Figure 3: Pitfalls of UniCL on novel categories. T-SNE of the embeddings produced after UniCL pre-training on the NIH-LT testing dataset. Large dots represented text prototypes. and small dots represent samples. Each color represents a category. The t-SNE representation shows that UniCL properly aligns labeled categories (top, right), but collapses on novel categories bottom, right.