Table of Contents
Fetching ...

HResFormer: Hybrid Residual Transformer for Volumetric Medical Image Segmentation

Sucheng Ren, Xiaomeng Li

TL;DR

HResFormer tackles 3D medical image segmentation by integrating a 2D transformer that captures fine-grained intra-slice details with a 3D transformer that models inter-slice context. The Hybrid Local-Global Fusion Module (HLGM) fuses 2D predictions and 3D data at local and global scales, while residual learning lets the 3D branch refine the 2D baseline, improving 3D anatomy understanding. Across multiple benchmarks, including Synapse, BRATS, and ACDC, it achieves state-of-the-art or competitive results with favorable compute, underscoring the value of a hybrid 2D-3D transformer design for volumetric medical segmentation. This approach aligns with radiologists’ axial-to-3D reasoning and offers a practical, scalable path for accurate clinical segmentation.

Abstract

Vision Transformer shows great superiority in medical image segmentation due to the ability in learning long-range dependency. For medical image segmentation from 3D data, such as computed tomography (CT), existing methods can be broadly classified into 2D-based and 3D-based methods. One key limitation in 2D-based methods is that the intra-slice information is ignored, while the limitation in 3D-based methods is the high computation cost and memory consumption, resulting in a limited feature representation for inner-slice information. During the clinical examination, radiologists primarily use the axial plane and then routinely review both axial and coronal planes to form a 3D understanding of anatomy. Motivated by this fact, our key insight is to design a hybrid model which can first learn fine-grained inner-slice information and then generate a 3D understanding of anatomy by incorporating 3D information. We present a novel \textbf{H}ybrid \textbf{Res}idual trans\textbf{Former} \textbf{(HResFormer)} for 3D medical image segmentation. Building upon standard 2D and 3D Transformer backbones, HResFormer involves two novel key designs: \textbf{(1)} a \textbf{H}ybrid \textbf{L}ocal-\textbf{G}lobal fusion \textbf{M}odule \textbf{(HLGM)} to effectively and adaptively fuse inner-slice information from 2D Transformer and intra-slice information from 3D volumes for 3D Transformer with local fine-grained and global long-range representation. \textbf{(2)} a residual learning of the hybrid model, which can effectively leverage the inner-slice and intra-slice information for better 3D understanding of anatomy. Experiments show that our HResFormer outperforms prior art on widely-used medical image segmentation benchmarks. This paper sheds light on an important but neglected way to design Transformers for 3D medical image segmentation.

HResFormer: Hybrid Residual Transformer for Volumetric Medical Image Segmentation

TL;DR

HResFormer tackles 3D medical image segmentation by integrating a 2D transformer that captures fine-grained intra-slice details with a 3D transformer that models inter-slice context. The Hybrid Local-Global Fusion Module (HLGM) fuses 2D predictions and 3D data at local and global scales, while residual learning lets the 3D branch refine the 2D baseline, improving 3D anatomy understanding. Across multiple benchmarks, including Synapse, BRATS, and ACDC, it achieves state-of-the-art or competitive results with favorable compute, underscoring the value of a hybrid 2D-3D transformer design for volumetric medical segmentation. This approach aligns with radiologists’ axial-to-3D reasoning and offers a practical, scalable path for accurate clinical segmentation.

Abstract

Vision Transformer shows great superiority in medical image segmentation due to the ability in learning long-range dependency. For medical image segmentation from 3D data, such as computed tomography (CT), existing methods can be broadly classified into 2D-based and 3D-based methods. One key limitation in 2D-based methods is that the intra-slice information is ignored, while the limitation in 3D-based methods is the high computation cost and memory consumption, resulting in a limited feature representation for inner-slice information. During the clinical examination, radiologists primarily use the axial plane and then routinely review both axial and coronal planes to form a 3D understanding of anatomy. Motivated by this fact, our key insight is to design a hybrid model which can first learn fine-grained inner-slice information and then generate a 3D understanding of anatomy by incorporating 3D information. We present a novel \textbf{H}ybrid \textbf{Res}idual trans\textbf{Former} \textbf{(HResFormer)} for 3D medical image segmentation. Building upon standard 2D and 3D Transformer backbones, HResFormer involves two novel key designs: \textbf{(1)} a \textbf{H}ybrid \textbf{L}ocal-\textbf{G}lobal fusion \textbf{M}odule \textbf{(HLGM)} to effectively and adaptively fuse inner-slice information from 2D Transformer and intra-slice information from 3D volumes for 3D Transformer with local fine-grained and global long-range representation. \textbf{(2)} a residual learning of the hybrid model, which can effectively leverage the inner-slice and intra-slice information for better 3D understanding of anatomy. Experiments show that our HResFormer outperforms prior art on widely-used medical image segmentation benchmarks. This paper sheds light on an important but neglected way to design Transformers for 3D medical image segmentation.

Paper Structure

This paper contains 15 sections, 10 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Overview of our HResFormer. It consists of three main parts: a 2D model, a Hybrid Local-Global fusion Module (HLGM), and a 3D model. The 3D volume is split into slices, and the 2D model makes predictions slice by slice to generate fine-grained inner-slice information. Then, the 2D predictions are stacked into a volume (we still call them as 2D predictions) and are fused with the 3D volume locally and globally via HLGM. Finally, the 3D model takes the fused features as input and incorporates the 2D predictions with residual learning for the better understanding of 3D anatomy with both inner-slice and inter-slice representations.
  • Figure 2: The overview of Hybrid Local-Global Fusion Module. We use one module to complement 2D prediction features to 3D volume features and 3D volume features to 2D prediction features.
  • Figure 3: Examples of multi-organ segmentation results. "GT" refers to the ground-truth. Red, magenta, pink, purple refer to aotra (Ao), liver (Li), stomach (St) and spleen (Sp), respectively. "2D" or "3D" refers to HResFormer (only 2D) and HResFormer (only 3D) in Table \ref{['tab:hybrid']}.