Table of Contents
Fetching ...

TAP-CT: 3D Task-Agnostic Pretraining of Computed Tomography Foundation Models

Tim Veenboer, George Yiasemis, Eric Marcus, Vivien Van Veldhuizen, Cees G. M. Snoek, Jonas Teuwen, Kevin B. W. Groot Lipman

TL;DR

This work introduces TAP-CT, a suite of 3D CT foundation models pretrained with a novel 3D adaptation of DINOv2 and ViT to learn task-agnostic, volumetric representations from 105K CT volumes. The approach employs GPU-accelerated 3D random resized crops and 3D masking, along with 3D patch embeddings and positional encodings, enabling scalable self-supervised pretraining on CT data. Across segmentation benchmarks, TAP-CT achieves state-of-the-art frozen-feature performance with a linear decoder, while classification results reveal challenges in obtaining robust global representations for scan-level tasks. The authors release pretrained weights, configurations, and benchmarking code to foster reproducibility and to establish a strong, low-resource baseline for medical-imaging foundation models, with future work focused on improving global 3D representations and reducing pretraining compute.

Abstract

Existing foundation models (FMs) in the medical domain often require extensive fine-tuning or rely on training resource-intensive decoders, while many existing encoders are pretrained with objectives biased toward specific tasks. This illustrates a need for a strong, task-agnostic foundation model that requires minimal fine-tuning beyond feature extraction. In this work, we introduce a suite of task-agnostic pretraining of CT foundation models (TAP-CT): a simple yet effective adaptation of Vision Transformers (ViTs) and DINOv2 for volumetric data, enabling scalable self-supervised pretraining directly on 3D CT volumes. Our approach incorporates targeted modifications to patch embeddings, positional encodings, and volumetric augmentations, making the architecture depth-aware while preserving the simplicity of the underlying architectures. We show that large-scale 3D pretraining on an extensive in-house CT dataset (105K volumes) yields stable, robust frozen representations that generalize strongly across downstream tasks. To promote transparency and reproducibility, and to establish a powerful, low-resource baseline for future research in medical imaging, we will release all pretrained models, experimental configurations, and downstream benchmark code at https://huggingface.co/fomofo/tap-ct-b-3d.

TAP-CT: 3D Task-Agnostic Pretraining of Computed Tomography Foundation Models

TL;DR

This work introduces TAP-CT, a suite of 3D CT foundation models pretrained with a novel 3D adaptation of DINOv2 and ViT to learn task-agnostic, volumetric representations from 105K CT volumes. The approach employs GPU-accelerated 3D random resized crops and 3D masking, along with 3D patch embeddings and positional encodings, enabling scalable self-supervised pretraining on CT data. Across segmentation benchmarks, TAP-CT achieves state-of-the-art frozen-feature performance with a linear decoder, while classification results reveal challenges in obtaining robust global representations for scan-level tasks. The authors release pretrained weights, configurations, and benchmarking code to foster reproducibility and to establish a strong, low-resource baseline for medical-imaging foundation models, with future work focused on improving global 3D representations and reducing pretraining compute.

Abstract

Existing foundation models (FMs) in the medical domain often require extensive fine-tuning or rely on training resource-intensive decoders, while many existing encoders are pretrained with objectives biased toward specific tasks. This illustrates a need for a strong, task-agnostic foundation model that requires minimal fine-tuning beyond feature extraction. In this work, we introduce a suite of task-agnostic pretraining of CT foundation models (TAP-CT): a simple yet effective adaptation of Vision Transformers (ViTs) and DINOv2 for volumetric data, enabling scalable self-supervised pretraining directly on 3D CT volumes. Our approach incorporates targeted modifications to patch embeddings, positional encodings, and volumetric augmentations, making the architecture depth-aware while preserving the simplicity of the underlying architectures. We show that large-scale 3D pretraining on an extensive in-house CT dataset (105K volumes) yields stable, robust frozen representations that generalize strongly across downstream tasks. To promote transparency and reproducibility, and to establish a powerful, low-resource baseline for future research in medical imaging, we will release all pretrained models, experimental configurations, and downstream benchmark code at https://huggingface.co/fomofo/tap-ct-b-3d.

Paper Structure

This paper contains 28 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Pretraining and Evaluation of TAP-CT Foundation Models. Models are pretrained using a novel 3D adaptation of the DINOv2 framework and subsequently evaluated based solely on the representational quality of their learned features.
  • Figure 2: Cosine similarity matching between lesion embeddings in an initial and a follow-up scan of the same patient. The first set of slices are ground-truth lesion segmentations across ten slices ($z$); the second set shows the top-$k$ voxel matches between averaged lesion embeddings from the initial scan and all embeddings in the next.
  • Figure 3: Distribution of pretraining data based on manufacturer, gender and age.
  • Figure 4: Segmentations across three abdominal slices from the AMOS22 validation sample (amos_286) for TAP-B-3D and other publicly available pretrained FMs. Each segmentation is produced using a linear convolutional layer fine-tuned on top of the frozen features of the respective pretrained encoder.