Table of Contents
Fetching ...

Medical Image Segmentation Using Directional Window Attention

Daniya Najiha Abdul Kareem, Mustansar Fiaz, Noa Novershtern, Hisham Cholakkal

TL;DR

This work introduces DwinFormer, a hierarchical encoder-decoder architecture for medical image segmentation comprising a directional window (Dwin) attention and global self-attention (GSA) for feature encoding and introduces Dwin block within DwinFormer that effectively captures local and global information along the horizontal, vertical, and depthwise directions of the input feature map.

Abstract

Accurate segmentation of medical images is crucial for diagnostic purposes, including cell segmentation, tumor identification, and organ localization. Traditional convolutional neural network (CNN)-based approaches struggled to achieve precise segmentation results due to their limited receptive fields, particularly in cases involving multi-organ segmentation with varying shapes and sizes. The transformer-based approaches address this limitation by leveraging the global receptive field, but they often face challenges in capturing local information required for pixel-precise segmentation. In this work, we introduce DwinFormer, a hierarchical encoder-decoder architecture for medical image segmentation comprising a directional window (Dwin) attention and global self-attention (GSA) for feature encoding. The focus of our design is the introduction of Dwin block within DwinFormer that effectively captures local and global information along the horizontal, vertical, and depthwise directions of the input feature map by separately performing attention in each of these directional volumes. To this end, our Dwin block introduces a nested Dwin attention (NDA) that progressively increases the receptive field in horizontal, vertical, and depthwise directions and a convolutional Dwin attention (CDA) that captures local contextual information for the attention computation. While the proposed Dwin block captures local and global dependencies at the first two high-resolution stages of DwinFormer, the GSA block encodes global dependencies at the last two lower-resolution stages. Experiments over the challenging 3D Synapse Multi-organ dataset and Cell HMS dataset demonstrate the benefits of our DwinFormer over the state-of-the-art approaches. Our source code will be publicly available at \url{https://github.com/Daniyanaj/DWINFORMER}.

Medical Image Segmentation Using Directional Window Attention

TL;DR

This work introduces DwinFormer, a hierarchical encoder-decoder architecture for medical image segmentation comprising a directional window (Dwin) attention and global self-attention (GSA) for feature encoding and introduces Dwin block within DwinFormer that effectively captures local and global information along the horizontal, vertical, and depthwise directions of the input feature map.

Abstract

Accurate segmentation of medical images is crucial for diagnostic purposes, including cell segmentation, tumor identification, and organ localization. Traditional convolutional neural network (CNN)-based approaches struggled to achieve precise segmentation results due to their limited receptive fields, particularly in cases involving multi-organ segmentation with varying shapes and sizes. The transformer-based approaches address this limitation by leveraging the global receptive field, but they often face challenges in capturing local information required for pixel-precise segmentation. In this work, we introduce DwinFormer, a hierarchical encoder-decoder architecture for medical image segmentation comprising a directional window (Dwin) attention and global self-attention (GSA) for feature encoding. The focus of our design is the introduction of Dwin block within DwinFormer that effectively captures local and global information along the horizontal, vertical, and depthwise directions of the input feature map by separately performing attention in each of these directional volumes. To this end, our Dwin block introduces a nested Dwin attention (NDA) that progressively increases the receptive field in horizontal, vertical, and depthwise directions and a convolutional Dwin attention (CDA) that captures local contextual information for the attention computation. While the proposed Dwin block captures local and global dependencies at the first two high-resolution stages of DwinFormer, the GSA block encodes global dependencies at the last two lower-resolution stages. Experiments over the challenging 3D Synapse Multi-organ dataset and Cell HMS dataset demonstrate the benefits of our DwinFormer over the state-of-the-art approaches. Our source code will be publicly available at \url{https://github.com/Daniyanaj/DWINFORMER}.
Paper Structure (10 sections, 1 equation, 3 figures, 5 tables)

This paper contains 10 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: (a) Overall architecture of the proposed DwinFormer having a hierarchical encoder-decoder framework. In the encoder, the stem features are input to the directional window (Dwin) block to explicitly learn the local and global dependencies at high resolution in the initial two stages of the encoder, whereas global self-attention (GSA) block is applied in the later two stages to capture the global information. In the decoder, the features are first upsampled and then added with the encoder features using a skip connection. The focus of our design is the introduction of (b) Dwin block into DwinFormer, enabling the effective capturing of local and global information in multiple directions within the input feature map. The Dwin block consists of two components:(c) nested Dwin attention (NDA) that gradually expands the receptive field in the depthwise, horizontal and vertical directions, and (d) convolutional Dwin attention (CDA) that strives to capture local contextual information using depthwise convolution during the attention computation,. (e) shows the qkv computation for attention in (i) Nested Dwin Attention (NDA) (ii) Convolutional Dwin Attention (CDA). The NDA employs a linear layer to obtain qkv while CDA additionally captures local information using a depthwise convolution.
  • Figure 2: Qualitative analysis on multi-organ synapse dataset landman2015miccai shows that our method provides improved segmentation by accurately detecting the organs with clear boundaries.
  • Figure 3: Qualitative results of DwinFormer on the cell HMS dataset cellseg. Rows correspond to views from different planes. DwinFormer predicts foreground (FG), background (BG), and cell boundary regions (columns 2-4) which are post-processed to segment cell instances (column 5)