Table of Contents
Fetching ...

CATFA-Net: A Trans-Convolutional Approach for Accurate Medical Image Segmentation

Siddhartha Mallick, Aayushman Ghosh, Jayanta Paul, Jaya Sil

Abstract

Convolutional blocks have played a crucial role in advancing medical image segmentation by excelling in dense prediction tasks. However, their inability to effectively capture long-range dependencies has limited their performance. Transformer-based architectures, leveraging attention mechanisms, address this limitation by modeling global context and creating expressive feature representations. Recent research has explored this potential by introducing hybrid frameworks that combine transformer encoders with convolutional decoders. Despite their advantages, these approaches face challenges such as limited inductive bias, high computational cost, and reduced robustness to data variability. To overcome these issues, this study introduces CATFA-Net, a novel and efficient segmentation framework designed to produce high-quality segmentation masks while reducing computational costs and increasing inference speed. CATFA-Net employs a hierarchical hybrid encoder architecture with a lightweight convolutional decoder backbone. Its transformer-based encoder uses a new Context Addition Attention mechanism that captures inter-image dependencies without the quadratic complexity of standard attention mechanisms. Features from the transformer branch are fused with those from the convolutional branch through a proposed Cross-Channel Attention mechanism, which helps retain spatial and channel information during downsampling. Additionally, a Spatial Fusion Attention mechanism in the decoder refines features while reducing background noise ambiguity. Extensive evaluations on five publicly available datasets show that CATFA-Net outperforms existing methods in accuracy and efficiency. The framework sets new state-of-the-art Dice scores on GLaS (94.48%) and ISIC 2018 (91.55%). Robustness tests and external validation further demonstrate its strong ability to generalize in binary segmentation tasks.

CATFA-Net: A Trans-Convolutional Approach for Accurate Medical Image Segmentation

Abstract

Convolutional blocks have played a crucial role in advancing medical image segmentation by excelling in dense prediction tasks. However, their inability to effectively capture long-range dependencies has limited their performance. Transformer-based architectures, leveraging attention mechanisms, address this limitation by modeling global context and creating expressive feature representations. Recent research has explored this potential by introducing hybrid frameworks that combine transformer encoders with convolutional decoders. Despite their advantages, these approaches face challenges such as limited inductive bias, high computational cost, and reduced robustness to data variability. To overcome these issues, this study introduces CATFA-Net, a novel and efficient segmentation framework designed to produce high-quality segmentation masks while reducing computational costs and increasing inference speed. CATFA-Net employs a hierarchical hybrid encoder architecture with a lightweight convolutional decoder backbone. Its transformer-based encoder uses a new Context Addition Attention mechanism that captures inter-image dependencies without the quadratic complexity of standard attention mechanisms. Features from the transformer branch are fused with those from the convolutional branch through a proposed Cross-Channel Attention mechanism, which helps retain spatial and channel information during downsampling. Additionally, a Spatial Fusion Attention mechanism in the decoder refines features while reducing background noise ambiguity. Extensive evaluations on five publicly available datasets show that CATFA-Net outperforms existing methods in accuracy and efficiency. The framework sets new state-of-the-art Dice scores on GLaS (94.48%) and ISIC 2018 (91.55%). Robustness tests and external validation further demonstrate its strong ability to generalize in binary segmentation tasks.
Paper Structure (28 sections, 7 equations, 14 figures, 9 tables)

This paper contains 28 sections, 7 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Demonstrating the importance of modeling long-range dependencies. Examples from various medical imaging benchmarks (GLaS, DS Bowl 2018, REFUGE, CVC Clinic DB, ISIC 2018) are shown. Blue outlines represent the ground truth (gt), red outlines indicate U-Net predictions, and green outlines show predictions from CATFA-Net (C-Net). While the convolution-based U-Net method misclassifies regions in several datasets due to its limited ability to capture long-range dependencies, CATFA-Net effectively addresses this limitation.
  • Figure 2: Overview of the proposed CATFA-Net model for efficient medical image segmentation. This architecture integrates ConvNeXt and H-CAT encoder branches for efficient feature extraction, utilizing advanced attention mechanisms such as context-addition attention, which captures inter-image resemblance to enhance feature representation, and cross-channel attention, which maintains consistency across spatial and channel dimensions. The encoder processes the input image through several stages, progressively reducing spatial resolution while deepening feature representation. The decoder reconstructs the segmentation map using spatial fusion mechanisms to reduce background ambiguity, along with bilinear up-sampling and skip connections to preserve spatial context effectively.
  • Figure 3: Overview the of key building blocks used in the proposed method. (a) The Context Addition Transformer (CAT) block integrates a novel Context Addition Self-Attention module, LN, and a depthwise Fully Convolutional Network (d-FCN) to capture global and contextual dependencies effectively while learning inter-image relations. (b) The ConvNext block, employing a $7\times7$ convolution kernel, LN, and a pointwise $1\times1$ convolution, provides a lightweight yet powerful local feature extraction mechanism. (c) The Conv-G-Next block built on ConvNext by incorporating Batch Normalization, a GELU activation layer, and an additional $1\times1$ convolution layer, enhancing non-linear transformations and enabling finer-grained feature representation while up-sampling.
  • Figure 4: Overview of the proposed Context Addition Self Attention block. The Context Attention Pre-attention (CAP) module enhances the $\mathbf{K}$ bit by using $\mathbf{Q}$ and $1\times1$ convolutions along with GeLU nonlinearity to learn inter-image dependencies. Simultaneously, $\mathbf{V}$ along with $\mathbf{K'}$ is passed through a spatial reduction block, reducing computational complexity of the whole process from $\mathcal{O}(N^2)$ to $\mathcal{O}(N^2/\mathcal{R})$, where $\mathcal{R}$ is the reduction ratio. The modified heads are then processed through a standard MSA block to produce the output $\mathbf{z'_s}$.
  • Figure 5: Overview of the Cross Channel Trans-Convolutional Fusion Attention block. The Cross Channel Attention module merges the outputs from the CAT and ConvNext blocks ($\mathbf{t_{out}}$ and $\mathbf{c_{out}}$) through a softmax-based fusion, enhancing global information along the channel dimension. The Spatial Attention component processes the ConvNext output through two pathways ($P_1$ and $P_2$) to remove noise and encode contextual information, preserving spatial resolution and long-range dependencies. The final output is a fusion of both channel and spatial attention, ensuring effective aggregation of multi-scale feature representations across both dimensions.
  • ...and 9 more figures