Table of Contents
Fetching ...

Towards Scalable Language-Image Pre-training for 3D Medical Imaging

Chenhui Zhao, Yiwei Lyu, Asadur Chowdury, Edward Harake, Akhil Kondepudi, Akshay Rao, Xinhai Hou, Honglak Lee, Todd Hollon

TL;DR

This work pioneer pre-training directly on uncurated studies, which both aligns more closely with the radiologist's workflow and provides a natural path to scalability, and introduces a novel hierarchical attention mechanism inspired by the intrinsic hierarchy of radiology data: slice, scan, and study.

Abstract

The scalability of current language-image pre-training for 3D medical imaging, such as CT and MRI, is constrained by the need for radiologists to manually curate raw clinical studies. In this work, we pioneer pre-training directly on uncurated studies, which both aligns more closely with the radiologist's workflow and provides a natural path to scalability. However, the unique structure of such data presents new challenges for existing model architectures, which were originally designed for 2D slices or single 3D scans. To address this, we introduce a novel hierarchical attention mechanism inspired by the intrinsic hierarchy of radiology data: slice, scan, and study. We denote our framework as Hierarchical attention for Language-Image Pre-training (HLIP). Trained on 220K studies with 3.13 million scans for brain MRI and 240K studies with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance, e.g., +10.5% balanced ACC on the proposed publicly available brain MRI benchmark Pub-Brain-5; +8.3% and +1.7% macro AUC on head CT benchmarks CQ500 and RSNA, respectively. HLIP also exhibits strong generalizability on existing 3D medical language-image pre-training benchmarks, e.g., +4.3% macro AUC on the Rad-ChestCT benchmark when pre-trained on CT-RATE. These results demonstrate that, with HLIP, directly pre-training on uncurated clinical datasets is a scalable and effective direction for language-image pre-training in 3D medical imaging. The code is available at https://github.com/Zch0414/hlip.

Towards Scalable Language-Image Pre-training for 3D Medical Imaging

TL;DR

This work pioneer pre-training directly on uncurated studies, which both aligns more closely with the radiologist's workflow and provides a natural path to scalability, and introduces a novel hierarchical attention mechanism inspired by the intrinsic hierarchy of radiology data: slice, scan, and study.

Abstract

The scalability of current language-image pre-training for 3D medical imaging, such as CT and MRI, is constrained by the need for radiologists to manually curate raw clinical studies. In this work, we pioneer pre-training directly on uncurated studies, which both aligns more closely with the radiologist's workflow and provides a natural path to scalability. However, the unique structure of such data presents new challenges for existing model architectures, which were originally designed for 2D slices or single 3D scans. To address this, we introduce a novel hierarchical attention mechanism inspired by the intrinsic hierarchy of radiology data: slice, scan, and study. We denote our framework as Hierarchical attention for Language-Image Pre-training (HLIP). Trained on 220K studies with 3.13 million scans for brain MRI and 240K studies with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance, e.g., +10.5% balanced ACC on the proposed publicly available brain MRI benchmark Pub-Brain-5; +8.3% and +1.7% macro AUC on head CT benchmarks CQ500 and RSNA, respectively. HLIP also exhibits strong generalizability on existing 3D medical language-image pre-training benchmarks, e.g., +4.3% macro AUC on the Rad-ChestCT benchmark when pre-trained on CT-RATE. These results demonstrate that, with HLIP, directly pre-training on uncurated clinical datasets is a scalable and effective direction for language-image pre-training in 3D medical imaging. The code is available at https://github.com/Zch0414/hlip.

Paper Structure

This paper contains 28 sections, 3 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Illustration of (a) an uncurated study for a patient. While previous work has relied on annotation and curation, HLIP enables language-image pre-training directly on uncurated data. (b) Despite training on large-scale domain-specific datasets, naively modeling the uncurated study with a vanilla ViT, e.g., by randomly selecting a scan at each training step$^{\ast}$, encoding scans independently before study aggregation$^{\dagger}$, or directly encoding the entire study$^{\ddagger}$, yields performance only comparable to the SOTA trained on PubMed corpus, whereas HLIP outperforms these by a large margin.
  • Figure 2: Illustration of (a) the radiology data hierarchy for a single patient, including the study, single scan, and adjacent slices. Our hierarchical attention mechanism mirrors this hierarchy and computes self-attention independently within each level. (b) Our HLIP framework incorporates a visual encoder that performs attention at different levels. In practice, lightweight slice or scan attention with a few study attention layers suffices to extract features from the full study.
  • Figure 3: Results of linear-probe and zero-shot evaluations on the CQ500 chilamkurthy2018deep and RSNA flanders2020construction datasets. We report AUC for each class. In linear-probe, red represents Google CT yang2024advancing; green represents FM-HeadCT zhu20253d; and blue represents our HLIP. In zero-shot evaluation, red represents the vanilla ViT and blue represents our HLIP.
  • Figure 4: Ablation study on the Pub-Brain-5 (a)-(d) and Rad-ChestCTdraelos2021machine dataset (e)-(h). We report balanced accuracy (ACC) on both datasets.
  • Figure 5: Qualitative results of zero-shot diagnosis on the Rad-ChestCT draelos2021machine and BraTS baid2021rsnalabella2023asnrmiccai datasets. The first row shows the original clinical scans with pathologic regions outlined (dashed red). The second row shows activation maps chefer2021generic. HLIP identifies the pathologic regions across multiple groups of adjacent chest CT slides (left) and brain MRI scans (right).
  • ...and 7 more figures