Table of Contents
Fetching ...

Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models

Antoine Saporta, Baptiste Callard, Corentin Dancette, Julien Khlaut, Charles Corbière, Leo Butsanets, Amaury Prat, Pierre Manceron

Abstract

The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.

Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models

Abstract

The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.

Paper Structure

This paper contains 15 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (a) Scaling Laws. The training dynamic steadily improves after each modification (cf. Section \ref{['sec:ablation']}). ☆: fine-tuning in 512 resolution. (b) Few-shot Learning (A1). Curia-2 g demonstrates quicker convergence and requires significantly fewer samples to achieve the same performance as other models.
  • Figure 2: (Left) Content-Aware Cropping. Comparison of valid (green) and filtered out (red) crop regions. (Right) Anatomically-Guided Masking. Each pair shows the average of 2000 mask samplings of our anatomically-guided masking against the blockwise masking with uniform prior strategy (DINOv2 oquab2023dinov2).
  • Figure 3: Dense Features. Cosine similarity maps between a query patch $\bm{+}$ and all patches in multiple images. Curia-2 g demonstrates strong semantic understanding of anatomical structures and cross-modality alignment capabilities, mapping structures between CT and MRI domains, even under rotations.