Table of Contents
Fetching ...

GreenRFM: Toward a resource-efficient radiology foundation model

Yingtai Li, Shuai Ming, Mingyue Zhao, Haoran Lai, Rongsheng Wang, Rui Zhou, Rundong Wang, Yujia Li, Wei Wei, Shaohua Kevin Zhou

TL;DR

This work presents a resource-efficient pre-training framework, GreenRFM, that achieves state-of-the-art performance and efficiency, and results on internal musculoskeletal MRI images show that the same supervision principles transfer between different modalities.

Abstract

The development of radiology foundation models (RFMs) is hindered by a reliance on brute-force scaling. Existing approaches often directly translate methods for natural images, which prioritize scale over precision and hence lead to brittle and expensive models in clinical practice. To address this, we present a resource-efficient pre-training framework, GreenRFM, that achieves state-of-the-art performance. Our framework ensures robust generalization across diverse patient populations and imaging protocols, reducing computational requirements by orders of magnitude while surpassing complex, parameter-heavy models. These capabilities stem from principled supervision design that aims to maximally utilize supervisory signals via More distilled, Ubiquitous, Semantic-enforcing, and Task-aligning (MUST) supervision, rather than simply piling up the quantity of training data. We offer two GreenRFM configurations: (i) a performant model that establishes a new state-of-the-art using a single 24GB GPU within 24 hours, and (ii) a lightweight model that matches existing benchmarks with 6GB VRAM in 4 hours. We conduct extensive experiments using over 200,000 images from four institutions and of two modalities. GreenRFMs achieve superior performances on chest and abdominal CT datasets, regardless of public or private benchmark, surpassing a range of baseline models. In addition, the results on internal musculoskeletal MRI images show that the same supervision principles transfer between different modalities. Our performance and efficiency challenge the ``scale is all you need'' dogma and democratize the equitable development of state-of-the-art RFMs for clinicians even on a laptop.

GreenRFM: Toward a resource-efficient radiology foundation model

TL;DR

This work presents a resource-efficient pre-training framework, GreenRFM, that achieves state-of-the-art performance and efficiency, and results on internal musculoskeletal MRI images show that the same supervision principles transfer between different modalities.

Abstract

The development of radiology foundation models (RFMs) is hindered by a reliance on brute-force scaling. Existing approaches often directly translate methods for natural images, which prioritize scale over precision and hence lead to brittle and expensive models in clinical practice. To address this, we present a resource-efficient pre-training framework, GreenRFM, that achieves state-of-the-art performance. Our framework ensures robust generalization across diverse patient populations and imaging protocols, reducing computational requirements by orders of magnitude while surpassing complex, parameter-heavy models. These capabilities stem from principled supervision design that aims to maximally utilize supervisory signals via More distilled, Ubiquitous, Semantic-enforcing, and Task-aligning (MUST) supervision, rather than simply piling up the quantity of training data. We offer two GreenRFM configurations: (i) a performant model that establishes a new state-of-the-art using a single 24GB GPU within 24 hours, and (ii) a lightweight model that matches existing benchmarks with 6GB VRAM in 4 hours. We conduct extensive experiments using over 200,000 images from four institutions and of two modalities. GreenRFMs achieve superior performances on chest and abdominal CT datasets, regardless of public or private benchmark, surpassing a range of baseline models. In addition, the results on internal musculoskeletal MRI images show that the same supervision principles transfer between different modalities. Our performance and efficiency challenge the ``scale is all you need'' dogma and democratize the equitable development of state-of-the-art RFMs for clinicians even on a laptop.
Paper Structure (41 sections, 7 equations, 8 figures, 13 tables)

This paper contains 41 sections, 7 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: A paradigm shift from brute-force scaling to principled supervision design.Panel A illustrates the inefficiencies of current brute-force scaling approaches, which rely on massive noisy data and prohibitive compute to train parameter-heavy models, often resulting in brittle and expensive outcomes. Panel B depicts our proposed principled supervision design. By leveraging LLMs distillation to create structured "Silver-Standard" labels and employing a streamlined architecture (ResNet-18), we achieve a robust and transferable model with 1/100 of the compute resources. Panel C quantitatively demonstrates this advantage: our method (green) achieves superior diagnostic performance while reducing computational cost and environmental impact by orders of magnitude compared to massive (red) and standard (orange) baselines.
  • Figure 2: Overview of the principled supervision design framework. The framework is built upon four key principles to maximize data efficiency. a, More distilled supervision: An LLM distills noisy radiology reports into structured diagnostic labels (present, absent, uncertain), creating a scalable source of supervision. b, Ubiquitous & semantics-enforcing supervision: A two-stage training strategy is employed. First, vision and text encoders are independently pre-trained using the structured labels. Then, they are transferred to the alignment stage. Supervision is applied ubiquitously across distinct stages — to the vision encoder, text encoder, and the shared alignment space. c, Task-aligning supervision: This panel illustrates the design constraints applied to the training pipeline to enforce strict consistency with downstream diagnosis task: (1) Domain consistency via a radiology-specific text encoder; (2) Goal consistency by prioritizing diagnostic labels; (3) Architectural consistency by aligning pooling strategies and removing $L_2$ normalization; and (4) Semantic focus consistency by anchoring the alignment space with a shared classifier.
  • Figure 3: Training dynamics, objective consistency, and data scaling.a, Data scaling efficiency. Our method exhibits a significantly steeper scaling curve compared to previous works, matching state-of-the-art performance with less than 50% of the data. b, Generalization scaling law. The zero-shot performance on external validation sets (RAD-ChestCT and AH-Chest) improves consistently with training data scale. c, Alignment selectivity analysis. Comparing the alignment loss on label-related vs. label-unrelated report sentences, the supervised pre-trained model maintains similar alignment loss on sentences that do not contain diagnostic labels (label-unrelated). d, Impact of supervision type. Pre-training with diagnostic labels (ours) achieves lower alignment loss and faster convergence compared to using visual descriptions. e, Two-stage vs. Joint (multi-task) representation learning. The vision encoder achieves a significantly lower loss when trained independently (two-stage, stage 1) compared to when trained jointly with alignment objectives. f, Impact of two-stage training on alignment. The two-stage model converges faster and reaches a lower alignment loss compared to the direct aligning baseline.
  • Figure 4: Ablation studies on task-aligning supervision design.a, Impact of supervision scope. Explicitly supervising every component (visual, text, and shared classifier) yields the best performance. b, Impact of text encoder. Domain-specific CXR-BERT outperforms general biomedical models. c, Impact of supervisory signal. Diagnostic labels provide stronger supervision than visual descriptions or no supervision. d, Impact of pooling strategy. Global Average Pooling (GAP) aligns best with downstream tasks compared to Noisy-OR or Max Pooling. e, Impact of $L_2$ normalization. Removing $L_2$ normalization preserves feature magnitude and improves performance.
  • Figure S1: Confusion matrices comparing LLM-extracted labels against Merlin official labels. The figure displays the confusion matrices for 30 abnormality categories. The rows represent the ground truth labels from the Merlin dataset, and the columns represent the labels extracted by the Doubao. The labels are defined as: 1 (present), 0 (absent), and -1 (uncertain/missing).
  • ...and 3 more figures