Table of Contents
Fetching ...

CATS v2: Hybrid encoders for robust medical segmentation

Hao Li, Han Liu, Dewei Hu, Xing Yao, Jiacheng Wang, Ipek Oguz

TL;DR

The paper addresses the limited global context of CNNs in 3D medical image segmentation by introducing CATS v2, a dual-encoder architecture that combines a CNN-based U-Net path with a Swin Transformer path using shifted windows. Features from both encoders are fused at multiple resolutions and fed into a CNN decoder, enabling simultaneous exploitation of local details and global context. Evaluations on BTCV, CrossMoDA, and MSD-5 demonstrate state-of-the-art Dice scores across tasks, with particularly notable gains on smaller or challenging structures, though BTCV gains are not uniformly distributed. The approach offers a robust backbone for medical segmentation and has potential as a foundation for future SAM-based or lightweight models in clinical settings.

Abstract

Convolutional Neural Networks (CNNs) have exhibited strong performance in medical image segmentation tasks by capturing high-level (local) information, such as edges and textures. However, due to the limited field of view of convolution kernel, it is hard for CNNs to fully represent global information. Recently, transformers have shown good performance for medical image segmentation due to their ability to better model long-range dependencies. Nevertheless, transformers struggle to capture high-level spatial features as effectively as CNNs. A good segmentation model should learn a better representation from local and global features to be both precise and semantically accurate. In our previous work, we proposed CATS, which is a U-shaped segmentation network augmented with transformer encoder. In this work, we further extend this model and propose CATS v2 with hybrid encoders. Specifically, hybrid encoders consist of a CNN-based encoder path paralleled to a transformer path with a shifted window, which better leverage both local and global information to produce robust 3D medical image segmentation. We fuse the information from the convolutional encoder and the transformer at the skip connections of different resolutions to form the final segmentation. The proposed method is evaluated on three public challenge datasets: Beyond the Cranial Vault (BTCV), Cross-Modality Domain Adaptation (CrossMoDA) and task 5 of Medical Segmentation Decathlon (MSD-5), to segment abdominal organs, vestibular schwannoma (VS) and prostate, respectively. Compared with the state-of-the-art methods, our approach demonstrates superior performance in terms of higher Dice scores. Our code is publicly available at https://github.com/MedICL-VU/CATS.

CATS v2: Hybrid encoders for robust medical segmentation

TL;DR

The paper addresses the limited global context of CNNs in 3D medical image segmentation by introducing CATS v2, a dual-encoder architecture that combines a CNN-based U-Net path with a Swin Transformer path using shifted windows. Features from both encoders are fused at multiple resolutions and fed into a CNN decoder, enabling simultaneous exploitation of local details and global context. Evaluations on BTCV, CrossMoDA, and MSD-5 demonstrate state-of-the-art Dice scores across tasks, with particularly notable gains on smaller or challenging structures, though BTCV gains are not uniformly distributed. The approach offers a robust backbone for medical segmentation and has potential as a foundation for future SAM-based or lightweight models in clinical settings.

Abstract

Convolutional Neural Networks (CNNs) have exhibited strong performance in medical image segmentation tasks by capturing high-level (local) information, such as edges and textures. However, due to the limited field of view of convolution kernel, it is hard for CNNs to fully represent global information. Recently, transformers have shown good performance for medical image segmentation due to their ability to better model long-range dependencies. Nevertheless, transformers struggle to capture high-level spatial features as effectively as CNNs. A good segmentation model should learn a better representation from local and global features to be both precise and semantically accurate. In our previous work, we proposed CATS, which is a U-shaped segmentation network augmented with transformer encoder. In this work, we further extend this model and propose CATS v2 with hybrid encoders. Specifically, hybrid encoders consist of a CNN-based encoder path paralleled to a transformer path with a shifted window, which better leverage both local and global information to produce robust 3D medical image segmentation. We fuse the information from the convolutional encoder and the transformer at the skip connections of different resolutions to form the final segmentation. The proposed method is evaluated on three public challenge datasets: Beyond the Cranial Vault (BTCV), Cross-Modality Domain Adaptation (CrossMoDA) and task 5 of Medical Segmentation Decathlon (MSD-5), to segment abdominal organs, vestibular schwannoma (VS) and prostate, respectively. Compared with the state-of-the-art methods, our approach demonstrates superior performance in terms of higher Dice scores. Our code is publicly available at https://github.com/MedICL-VU/CATS.
Paper Structure (12 sections, 4 figures, 3 tables)

This paper contains 12 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (a) Proposed network architecture. (b) 2D illustrations of shifted window where self-attention is only computed within each non-overlapping local window. Note that the patch sizes vary.
  • Figure 2: Qualitative results in BTCV. Some major differences are highlighted by orange arrows.
  • Figure 3: Qualitative results in CrossMoDA. Local segmentation errors are highlighted with arrows.
  • Figure 4: Qualitative results in MSD-5. Local segmentation errors are highlighted with arrows. Red and green labels denote the peripheral zone (PZ) and the transition zone (TZ), respectively.