Table of Contents
Fetching ...

AgileFormer: Spatially Agile Transformer UNet for Medical Image Segmentation

Peijie Qiu, Jin Yang, Sayantan Kumar, Soumyendu Sekhar Ghosh, Aristeidis Sotiras

TL;DR

AgileFormer addresses the challenge of segmenting medical images with heterogeneous object appearances by introducing spatially dynamic components into a ViT-UNet framework. It combines deformable patch embedding, spatially dynamic self-attention (deformable and neighborhood variants), and multi-scale deformable positional encoding to capture varying shapes and sizes across targets. Across 2D and 3D experiments on Synapse, ACDC, and Decathlon brain tumor datasets, AgileFormer achieves state-of-the-art performance with favorable parameter and compute characteristics, demonstrating strong scalability. The work provides a systematic design blueprint for spatially adaptive ViT-UNets in medical image segmentation and offers open-source code for reproducibility and further research.

Abstract

In the past decades, deep neural networks, particularly convolutional neural networks, have achieved state-of-the-art performance in a variety of medical image segmentation tasks. Recently, the introduction of the vision transformer (ViT) has significantly altered the landscape of deep segmentation models. There has been a growing focus on ViTs, driven by their excellent performance and scalability. However, we argue that the current design of the vision transformer-based UNet (ViT-UNet) segmentation models may not effectively handle the heterogeneous appearance (e.g., varying shapes and sizes) of objects of interest in medical image segmentation tasks. To tackle this challenge, we present a structured approach to introduce spatially dynamic components to the ViT-UNet. This adaptation enables the model to effectively capture features of target objects with diverse appearances. This is achieved by three main components: \textbf{(i)} deformable patch embedding; \textbf{(ii)} spatially dynamic multi-head attention; \textbf{(iii)} deformable positional encoding. These components were integrated into a novel architecture, termed AgileFormer. AgileFormer is a spatially agile ViT-UNet designed for medical image segmentation. Experiments in three segmentation tasks using publicly available datasets demonstrated the effectiveness of the proposed method. The code is available at \href{https://github.com/sotiraslab/AgileFormer}{https://github.com/sotiraslab/AgileFormer}.

AgileFormer: Spatially Agile Transformer UNet for Medical Image Segmentation

TL;DR

AgileFormer addresses the challenge of segmenting medical images with heterogeneous object appearances by introducing spatially dynamic components into a ViT-UNet framework. It combines deformable patch embedding, spatially dynamic self-attention (deformable and neighborhood variants), and multi-scale deformable positional encoding to capture varying shapes and sizes across targets. Across 2D and 3D experiments on Synapse, ACDC, and Decathlon brain tumor datasets, AgileFormer achieves state-of-the-art performance with favorable parameter and compute characteristics, demonstrating strong scalability. The work provides a systematic design blueprint for spatially adaptive ViT-UNets in medical image segmentation and offers open-source code for reproducibility and further research.

Abstract

In the past decades, deep neural networks, particularly convolutional neural networks, have achieved state-of-the-art performance in a variety of medical image segmentation tasks. Recently, the introduction of the vision transformer (ViT) has significantly altered the landscape of deep segmentation models. There has been a growing focus on ViTs, driven by their excellent performance and scalability. However, we argue that the current design of the vision transformer-based UNet (ViT-UNet) segmentation models may not effectively handle the heterogeneous appearance (e.g., varying shapes and sizes) of objects of interest in medical image segmentation tasks. To tackle this challenge, we present a structured approach to introduce spatially dynamic components to the ViT-UNet. This adaptation enables the model to effectively capture features of target objects with diverse appearances. This is achieved by three main components: \textbf{(i)} deformable patch embedding; \textbf{(ii)} spatially dynamic multi-head attention; \textbf{(iii)} deformable positional encoding. These components were integrated into a novel architecture, termed AgileFormer. AgileFormer is a spatially agile ViT-UNet designed for medical image segmentation. Experiments in three segmentation tasks using publicly available datasets demonstrated the effectiveness of the proposed method. The code is available at \href{https://github.com/sotiraslab/AgileFormer}{https://github.com/sotiraslab/AgileFormer}.
Paper Structure (23 sections, 7 equations, 10 figures, 5 tables)

This paper contains 23 sections, 7 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Left: Qualitative comparison between the proposed AgileFormer and nnFormer zhou2021nnformer on the Synapse multi-organ segmentation task, where each white dashed box marks inaccurate segmented regions. For demonstration purposes, we visualize all 13 organs, but we only report results for 8 of them. We can observe that nnFormer struggled to accurately segment the spleen and stomach, confusing one for the other. Moreover, it over-segmented the right kidney. This is likely due to the fact that fixed-sizing window attention and patch embedding cannot accurately capture objects with varying sizes and shapes, and hence produce inaccurate feature representations. In contrast, AgileFormer, which can capture spatially varying representations via deformable patch embedding, spatially dynamic self-attention and multi-scale deformable positional encoding, accurately segmented organs with varying sizes and shapes. Right: Segmentation accuracy (DSC) against model complexity (number of parameters and FLOPs) on the Synapse multi-organ segmentation task. In both 2D and 3D settings, AgileFormer outperformed recent state-of-the-art methods, while having fewer parameters and FLOPs.
  • Figure 2: A roadmap going from a SwinUNet to the design of AgileFormer on Synapse dataset. From top to bottom, each row represents a model design variant, including patch embedding, self-attention, and positional encoding. The foreground bars represent DSC (%) in the FLOP (G) regime of different design variants; a hatched bar means the modification results in a performance drop.
  • Figure 3: The overview of the proposed AgileFormer. AgileFormer is a U-shape vision transformer consisting of deformable patch embedding as well as neighborhood and deformable self-attention building block with a deformable positional encoding. For illustrative purposes, we take the 2D AgileFormer as an example. Please refer to \ref{['sec:3.4']} for the detailed discussion on extending 2D AgileFormer to 3D AgileFormer for volumetric segmentation tasks.
  • Figure 4: The proposed multi-scale deformable positional encoding ($\operatorname{MS-DePE}$) for irregularly sampled grids in deformable multi-head self-attention. For illustrative purposes, we take the 2D model as an example, where 2D deformable depth-wise convolution is used. In the 3D model, the 2D deformable depth-wise convolution should be replaced with its 3D counterpart.
  • Figure 5: Comparison of model scalability on the Synapse dataset. The base model is almost four times larger than the tiny model. The proposed AgileFormer demonstrated exceptional scalability when adding parameters compared to other methods.
  • ...and 5 more figures