Table of Contents
Fetching ...

Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks

Rubén Moreno-Aguado, Alba Magallón, Victor Moreno, Yingying Fang, Guang Yang

Abstract

There is substantial interest in developing artificial intelligence systems to support radiologists across tasks ranging from segmentation to report generation. Existing computed tomography (CT) foundation models have largely focused on building generalist vision-language systems capable of tasks such as question answering and report generation. However, training reliable vision-language systems requires paired image-text data at a scale that remains unavailable in CT. Moreover, adapting the underlying visual representations to downstream tasks typically requires partial or full backbone fine-tuning, a computationally demanding process inaccessible to many research groups. Instead, foundation models should prioritise learning robust visual representations that enable efficient transfer to new tasks with minimal labelled data and without backbone fine-tuning. We present VoxelFM, a 3D CT foundation model trained with self-distillation using the DINO framework, which learns semantically rich features without language supervision. We evaluated VoxelFM across seven categories of clinically relevant downstream tasks using frozen backbone representations with lightweight probes: classification, regression, survival analysis, instance retrieval, localisation, segmentation, and report generation. VoxelFM matched or outperformed four existing CT foundation models across all task categories. Despite receiving no language supervision during pre-training, VoxelFM surpassed models explicitly trained with language-alignment objectives, including on report generation. Our results indicate that current CT foundation models perform significantly better as feature extractors for lightweight probes rather than as vision encoders for vision-language models. Model weights and training code are publicly available.

Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks

Abstract

There is substantial interest in developing artificial intelligence systems to support radiologists across tasks ranging from segmentation to report generation. Existing computed tomography (CT) foundation models have largely focused on building generalist vision-language systems capable of tasks such as question answering and report generation. However, training reliable vision-language systems requires paired image-text data at a scale that remains unavailable in CT. Moreover, adapting the underlying visual representations to downstream tasks typically requires partial or full backbone fine-tuning, a computationally demanding process inaccessible to many research groups. Instead, foundation models should prioritise learning robust visual representations that enable efficient transfer to new tasks with minimal labelled data and without backbone fine-tuning. We present VoxelFM, a 3D CT foundation model trained with self-distillation using the DINO framework, which learns semantically rich features without language supervision. We evaluated VoxelFM across seven categories of clinically relevant downstream tasks using frozen backbone representations with lightweight probes: classification, regression, survival analysis, instance retrieval, localisation, segmentation, and report generation. VoxelFM matched or outperformed four existing CT foundation models across all task categories. Despite receiving no language supervision during pre-training, VoxelFM surpassed models explicitly trained with language-alignment objectives, including on report generation. Our results indicate that current CT foundation models perform significantly better as feature extractors for lightweight probes rather than as vision encoders for vision-language models. Model weights and training code are publicly available.

Paper Structure

This paper contains 38 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the evaluation protocol. All evaluations use pre-computed embeddings from a pre-trained encoder backbone. The first common step is (A): embeddings are extracted by the backbone and cached for downstream use. (B) Classification and regression: two methods are available depending on the token type used. For the class token, a two-layer MLP is applied, and for patch tokens, a Q-Former with cross-attention is used instead. (C) Instance retrieval: class tokens are ranked against a query token by cosine similarity. (D) Localisation: patch tokens are decoded to normalised 3D coordinates via multi-head self-attention, with a softmax operation used to weight the positional coordinates. (E) Segmentation: patch tokens are reshaped into a 3D feature map and decoded through 3D convolutional and upsampling layers. (F) Report generation: patch tokens are projected via cross-attention and combined with a system prompt before being passed to a large language model (LLM).
  • Figure 2: Bar chart summary of model performance across six tasks. Results are shown for classification (AUROC), regression (MAE $\downarrow$), survival analysis (AUROC), localisation (MAE $\downarrow$), segmentation (DICE), and retrieval (Recall@10), with 95% confidence intervals displayed for each model. TotalSegmentator segmentation is reported as micro-averaged DICE. CT-RATE and Merlin classification results are macro-averaged over all abnormality classes. See Table \ref{['tab:results_main']} for full numerical results.
  • Figure 3: Per-abnormality breakdown of classification and report generation performance across 18 findings. (Left) Binary classification F1 scores (threshold $= 0.5$) for each of the 18 abnormalities in the CT-RATE dataset, where an individual Q-Former probe is trained per label as described in Figure \ref{['fig:evalmethods']}. (Right) Corresponding report generation F1 scores for the same 18 CT-RATE abnormality labels, aligned row-by-row with the left panel. A red cross marks the best-performing classifier result in each row.
  • Figure 4: Effect of downstream dataset size. Q-Former probes are trained on various fractions of labelled training data (20%--100%) and evaluated on a fixed held-out test set. (Left) iCTCF-Covid. (Right) RSNA-STR. Error bars represent 95% confidence intervals.
  • Figure 5: Comparison of inference strategies and feature aggregation methods. (Left) MLP applied to the class token versus a single-layer Q-Former applied to patch tokens for classification tasks. CT-RATE and Merlin results are macro-averaged over their respective abnormality labels. Error bars represent 95% confidence intervals. (Right) Chunked 2.5D versus full 3D inference across classification tasks.
  • ...and 1 more figures