Table of Contents
Fetching ...

Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation

Pengfei Gu, Huimin Li, Yejia Zhang, Chaoli Wang, Danny Z. Chen

TL;DR

This work tackles data scarcity in 3D medical image segmentation by extending masked autoencoder pre-training with topology- and spatiality-aware components tailored for 3D data. It introduces a differentiable topology loss based on persistent homology and 2-Wasserstein distances, plus a 9-point global spatial pre-text task, to enrich geometric and positional representations. The approach co-pretrains ViT and a hybrid SOTA architecture (UNETR++), using reconstruction, topology, and spatial losses along with spatial and reconstruction consistency losses, and then fine-tunes by fusing the pre-trained ViT encoder with a pre-trained UNETR++ decoder. Experiments on five public datasets show consistent improvements over strong baselines, including MAE-based pre-training and UNETR++, with notable gains in Dice and HD95 across synapse, BTCV, ACDC, and MSD datasets, highlighting the method's effectiveness and practical impact in 3D medical segmentation.

Abstract

Masked Autoencoders (MAEs) have been shown to be effective in pre-training Vision Transformers (ViTs) for natural and medical image analysis problems. By reconstructing missing pixel/voxel information in visible patches, a ViT encoder can aggregate contextual information for downstream tasks. But, existing MAE pre-training methods, which were specifically developed with the ViT architecture, lack the ability to capture geometric shape and spatial information, which is critical for medical image segmentation tasks. In this paper, we propose a novel extension of known MAEs for self pre-training (i.e., models pre-trained on the same target dataset) for 3D medical image segmentation. (1) We propose a new topological loss to preserve geometric shape information by computing topological signatures of both the input and reconstructed volumes, learning geometric shape information. (2) We introduce a pre-text task that predicts the positions of the centers and eight corners of 3D crops, enabling the MAE to aggregate spatial information. (3) We extend the MAE pre-training strategy to a hybrid state-of-the-art (SOTA) medical image segmentation architecture and co-pretrain it alongside the ViT. (4) We develop a fine-tuned model for downstream segmentation tasks by complementing the pre-trained ViT encoder with our pre-trained SOTA model. Extensive experiments on five public 3D segmentation datasets show the effectiveness of our new approach.

Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation

TL;DR

This work tackles data scarcity in 3D medical image segmentation by extending masked autoencoder pre-training with topology- and spatiality-aware components tailored for 3D data. It introduces a differentiable topology loss based on persistent homology and 2-Wasserstein distances, plus a 9-point global spatial pre-text task, to enrich geometric and positional representations. The approach co-pretrains ViT and a hybrid SOTA architecture (UNETR++), using reconstruction, topology, and spatial losses along with spatial and reconstruction consistency losses, and then fine-tunes by fusing the pre-trained ViT encoder with a pre-trained UNETR++ decoder. Experiments on five public datasets show consistent improvements over strong baselines, including MAE-based pre-training and UNETR++, with notable gains in Dice and HD95 across synapse, BTCV, ACDC, and MSD datasets, highlighting the method's effectiveness and practical impact in 3D medical segmentation.

Abstract

Masked Autoencoders (MAEs) have been shown to be effective in pre-training Vision Transformers (ViTs) for natural and medical image analysis problems. By reconstructing missing pixel/voxel information in visible patches, a ViT encoder can aggregate contextual information for downstream tasks. But, existing MAE pre-training methods, which were specifically developed with the ViT architecture, lack the ability to capture geometric shape and spatial information, which is critical for medical image segmentation tasks. In this paper, we propose a novel extension of known MAEs for self pre-training (i.e., models pre-trained on the same target dataset) for 3D medical image segmentation. (1) We propose a new topological loss to preserve geometric shape information by computing topological signatures of both the input and reconstructed volumes, learning geometric shape information. (2) We introduce a pre-text task that predicts the positions of the centers and eight corners of 3D crops, enabling the MAE to aggregate spatial information. (3) We extend the MAE pre-training strategy to a hybrid state-of-the-art (SOTA) medical image segmentation architecture and co-pretrain it alongside the ViT. (4) We develop a fine-tuned model for downstream segmentation tasks by complementing the pre-trained ViT encoder with our pre-trained SOTA model. Extensive experiments on five public 3D segmentation datasets show the effectiveness of our new approach.
Paper Structure (13 sections, 1 equation, 3 figures, 6 tables)

This paper contains 13 sections, 1 equation, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Illustrating the effect of our proposed topological loss. Left: raw image examples of the Synapse CT dataset; middle: reconstructed images with the mean squared error (MSE) loss zhou2022self; right: reconstructed images with a combination of the MSE and proposed topological losses.
  • Figure 2: An overview of our proposed pipeline.
  • Figure 3: Visual results of different methods on the Synapse CT dataset.