Table of Contents
Fetching ...

Large-Scale 3D Medical Image Pre-training with Geometric Context Priors

Linshan Wu, Jiaxin Zhuang, Hao Chen

TL;DR

A simple-yet-effective Volume Contrast (VoCo) framework to leverage geometric context priors for self-supervision, and notably enhances performance on datasets with limited labeled cases and significantly expedites fine-tuning convergence.

Abstract

The scarcity of annotations poses a significant challenge in medical image analysis. Large-scale pre-training has emerged as a promising label-efficient solution, owing to the utilization of large-scale data, large models, and advanced pre-training techniques. However, its development in medical images remains underexplored. The primary challenge lies in harnessing large-scale unlabeled data and learning high-level semantics without annotations. We observe that 3D medical images exhibit consistent geometric context, i.e., consistent geometric relations between different organs, which leads to a promising way for learning consistent representations. Motivated by this, we introduce a simple-yet-effective Volume Contrast (VoCo) framework to leverage geometric context priors for self-supervision. Given an input volume, we extract base crops from different regions to construct positive and negative pairs for contrastive learning. Then we predict the contextual position of a random crop by contrasting its similarity to the base crops. In this way, VoCo encodes the inherent geometric context into model representations, facilitating high-level semantic learning without annotations. Specifically, we (1) introduce the largest medical pre-training dataset PreCT-160K; (2) investigate scaling laws and propose guidelines for tailoring different model sizes to various medical tasks; (3) build a benchmark encompassing 48 medical tasks. Extensive experiments highlight the superiority of VoCo. Codes at https://github.com/Luffy03/Large-Scale-Medical.

Large-Scale 3D Medical Image Pre-training with Geometric Context Priors

TL;DR

A simple-yet-effective Volume Contrast (VoCo) framework to leverage geometric context priors for self-supervision, and notably enhances performance on datasets with limited labeled cases and significantly expedites fine-tuning convergence.

Abstract

The scarcity of annotations poses a significant challenge in medical image analysis. Large-scale pre-training has emerged as a promising label-efficient solution, owing to the utilization of large-scale data, large models, and advanced pre-training techniques. However, its development in medical images remains underexplored. The primary challenge lies in harnessing large-scale unlabeled data and learning high-level semantics without annotations. We observe that 3D medical images exhibit consistent geometric context, i.e., consistent geometric relations between different organs, which leads to a promising way for learning consistent representations. Motivated by this, we introduce a simple-yet-effective Volume Contrast (VoCo) framework to leverage geometric context priors for self-supervision. Given an input volume, we extract base crops from different regions to construct positive and negative pairs for contrastive learning. Then we predict the contextual position of a random crop by contrasting its similarity to the base crops. In this way, VoCo encodes the inherent geometric context into model representations, facilitating high-level semantic learning without annotations. Specifically, we (1) introduce the largest medical pre-training dataset PreCT-160K; (2) investigate scaling laws and propose guidelines for tailoring different model sizes to various medical tasks; (3) build a benchmark encompassing 48 medical tasks. Extensive experiments highlight the superiority of VoCo. Codes at https://github.com/Luffy03/Large-Scale-Medical.

Paper Structure

This paper contains 25 sections, 9 equations, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview: (a) We curate a large-scale 3D medical dataset PreCT-160K for pre-training. To the best of our knowledge, it is the existing largest pre-training dataset in this field, comprising 160K CT volumes (42M slices). (b) We investigate the scaling law in medical image pre-training, where VoCo stands out from previous methods in both data scale and model capacity. (c) We build a comprehensive benchmark for evaluation, which contains 48 downstream datasets across different tasks, i.e., segmentation, classification, registration, and vision-language (VL). Extensive experiments highlight the effectiveness of our proposed large-scale pre-training method.
  • Figure 2: Motivation of VoCo. In 3D medical images, the geometric relations between different organs are relatively consistent. We present some examples from PreCT-160K to illustrate these anatomical relationships across different regions. Motivated by this observation, we propose to leverage geometric context priors for learning consistent semantic representations and introduce a novel position prediction pretext task for pre-training.
  • Figure 3: Generate position labels for supervision. A pair of random crop $k$ and base crop $q$ are assigned as positive if they share overlap areas, otherwise as negative. We calculate the overlap proportions as position labels $y$, e.g., ${y}_{1},{y}_{2},{y}_{5},{y}_{6}$ are assigned as $0.2,0.3,0.2,0.3$, respectively.
  • Figure 4: Overall framework of VoCo. (a) First, we generate base crops $q$ with corresponding position labels $y$ (Sec. \ref{['sec3_1']} & Fig. \ref{['fig_position']}). Then we input the random crop $k$ and base crops $q$ for contextual position prediction. Specifically, we employ a student-teacher module to project $k$ and $q$ separately, where the teacher projector is frozen and updated from the student projector with Exponential Moving Average (EMA). Finally, we conduct volume contrast between $k$ and $q$ to predict similarity $s$ (Eq. \ref{['Eq_similarity']}), where $s$ is supervised by position labels $y$ (Eq. \ref{['Eq_L_pred']}). (b) We use the position labels to supervise the intra-volume contrast on $k$, $q_{pos}$, and $q_{neg}$, where $k$, $q_{pos}$, and $q_{neg}$ are from the same volume. (c) We extract random crop $k_{A}$ and base crops $q_{B}$ from different volumes $V_{A}$ and $V_{B}$ for inter-volume contrast.
  • Figure 5: Differences among fully-, self-, and omni-supervised learning. Solid and hollow markers denote labeled and unlabeled data, respectively. Dashed lines denote decision boundaries between different classes.
  • ...and 6 more figures