Table of Contents
Fetching ...

Masked LoGoNet: Fast and Accurate 3D Image Analysis for Medical Domain

Amin Karimi Monsefi, Payam Karisani, Mengxi Zhou, Stacey Choi, Nathan Doble, Heng Ji, Srinivasan Parthasarathy, Rajiv Ramnath

TL;DR

LoGoNet addresses the dual challenges of data scarcity and high inference demand in 3D medical image segmentation by integrating ULKANet with Large Kernel Attention into a dual global-local encoder, and augmenting it with a 3D self-supervised pretraining regime that jointly masks sequences of 3D patches and learns via multi-cluster pseudo-labels. The global and local streams are fused to preserve both long-range context and fine-grained details, enabling robust segmentation of irregular organs like the spleen while maintaining low inference-time costs. Extensive experiments on BTCV and MSD show LoGoNet achieving state-of-the-art Dice scores with favorable computational efficiency, and pretraining further enhances performance across tasks, highlighting practical impact for clinically scalable deployment. The proposed masking-based SSL, compatible with both CNN and ViT backbones, plus the dual-encoding scheme, provides a versatile framework for leveraging unlabeled 3D medical data to improve generalization under limited labeled data regimes.

Abstract

Standard modern machine-learning-based imaging methods have faced challenges in medical applications due to the high cost of dataset construction and, thereby, the limited labeled training data available. Additionally, upon deployment, these methods are usually used to process a large volume of data on a daily basis, imposing a high maintenance cost on medical facilities. In this paper, we introduce a new neural network architecture, termed LoGoNet, with a tailored self-supervised learning (SSL) method to mitigate such challenges. LoGoNet integrates a novel feature extractor within a U-shaped architecture, leveraging Large Kernel Attention (LKA) and a dual encoding strategy to capture both long-range and short-range feature dependencies adeptly. This is in contrast to existing methods that rely on increasing network capacity to enhance feature extraction. This combination of novel techniques in our model is especially beneficial in medical image segmentation, given the difficulty of learning intricate and often irregular body organ shapes, such as the spleen. Complementary, we propose a novel SSL method tailored for 3D images to compensate for the lack of large labeled datasets. The method combines masking and contrastive learning techniques within a multi-task learning framework and is compatible with both Vision Transformer (ViT) and CNN-based models. We demonstrate the efficacy of our methods in numerous tasks across two standard datasets (i.e., BTCV and MSD). Benchmark comparisons with eight state-of-the-art models highlight LoGoNet's superior performance in both inference time and accuracy.

Masked LoGoNet: Fast and Accurate 3D Image Analysis for Medical Domain

TL;DR

LoGoNet addresses the dual challenges of data scarcity and high inference demand in 3D medical image segmentation by integrating ULKANet with Large Kernel Attention into a dual global-local encoder, and augmenting it with a 3D self-supervised pretraining regime that jointly masks sequences of 3D patches and learns via multi-cluster pseudo-labels. The global and local streams are fused to preserve both long-range context and fine-grained details, enabling robust segmentation of irregular organs like the spleen while maintaining low inference-time costs. Extensive experiments on BTCV and MSD show LoGoNet achieving state-of-the-art Dice scores with favorable computational efficiency, and pretraining further enhances performance across tasks, highlighting practical impact for clinically scalable deployment. The proposed masking-based SSL, compatible with both CNN and ViT backbones, plus the dual-encoding scheme, provides a versatile framework for leveraging unlabeled 3D medical data to improve generalization under limited labeled data regimes.

Abstract

Standard modern machine-learning-based imaging methods have faced challenges in medical applications due to the high cost of dataset construction and, thereby, the limited labeled training data available. Additionally, upon deployment, these methods are usually used to process a large volume of data on a daily basis, imposing a high maintenance cost on medical facilities. In this paper, we introduce a new neural network architecture, termed LoGoNet, with a tailored self-supervised learning (SSL) method to mitigate such challenges. LoGoNet integrates a novel feature extractor within a U-shaped architecture, leveraging Large Kernel Attention (LKA) and a dual encoding strategy to capture both long-range and short-range feature dependencies adeptly. This is in contrast to existing methods that rely on increasing network capacity to enhance feature extraction. This combination of novel techniques in our model is especially beneficial in medical image segmentation, given the difficulty of learning intricate and often irregular body organ shapes, such as the spleen. Complementary, we propose a novel SSL method tailored for 3D images to compensate for the lack of large labeled datasets. The method combines masking and contrastive learning techniques within a multi-task learning framework and is compatible with both Vision Transformer (ViT) and CNN-based models. We demonstrate the efficacy of our methods in numerous tasks across two standard datasets (i.e., BTCV and MSD). Benchmark comparisons with eight state-of-the-art models highlight LoGoNet's superior performance in both inference time and accuracy.
Paper Structure (20 sections, 8 equations, 4 figures, 14 tables, 4 algorithms)

This paper contains 20 sections, 8 equations, 4 figures, 14 tables, 4 algorithms.

Figures (4)

  • Figure 1: \ref{['fig:overview']}) Overview of our model LoGoNet. In order to take into account the local and global feature dependencies in images, they are fed into the model in parallel. In the local mechanism, the input data is partitioned into small parts, and each part is separately fed into our feature extractor (ULKANet). \ref{['fig:ulkanet']}) Overview of the ULKANet Architecture. A U-shaped network with the encoder-decoder design. Blue circles represent encoder blocks, and green circles represent the decoder blocks. The $+$ sign represents element-wise summation, and the $\times$ sign represents the concatenation operator.
  • Figure 2: Illustration of our pre-training pipeline. We begin by randomly selecting a set of $m$ sequential images (here $m$ is four), on which we apply patching and masking. Then LoGoNet is used to predict the set of pseudo-labels that we generated for each distorted image (see Section \ref{['sub-sec:pretrain']} for details). During the pre-training stage, a classification head (a feed-forward network) is used on top of the model for prediction. This head is replaced with a convolution head (see Figure \ref{['fig:overview']}) for fine-tuning on the segmentation task with labeled data.
  • Figure 3: Output of LoGoNet compared to the best performing baseline model in BTCV dataset, i.e., DiNTS Search. We see that our model tangibly outperforms the mentioned model in detecting organ boundaries.
  • Figure 4: Architecture of our feature extractor (ULKANet). The numbers next to some of the components indicate a sequence of the depicted component with the specified length.