MACMD: Multi-dilated Contextual Attention and Channel Mixer Decoding for Medical Image Segmentation
Lalit Maurya, Honghai Liu, Reyer Zwiggelaar
TL;DR
The paper tackles the challenge of balancing local detail with global context in medical image segmentation, where CNNs struggle with long-range dependencies and transformers incur high computation. It introduces the MACMD decoder, a multi-module skip-connection framework that combines MCAG, APM, MSCCM, and MEAB to enable multi-scale, attention-guided feature fusion between encoders and decoders. Through extensive experiments on BUSI, ISIC 2017, and Synapse using MaxViT-T and PVT-V2-B2 encoders, MACMD achieves state-of-the-art Dice scores with substantial efficiency gains, and ablations confirm that each module contributes positively to performance. The results suggest that MACMD provides a practical, scalable solution for precise, robust medical image segmentation with improved interpretability via Grad-CAM visualizations, albeit with some limitations in boundary precision and volumetric context in 2D slices.
Abstract
Medical image segmentation faces challenges due to variations in anatomical structures. While convolutional neural networks (CNNs) effectively capture local features, they struggle with modeling long-range dependencies. Transformers mitigate this issue with self-attention mechanisms but lack the ability to preserve local contextual information. State-of-the-art models primarily follow an encoder-decoder architecture, achieving notable success. However, two key limitations remain: (1) Shallow layers, which are closer to the input, capture fine-grained details but suffer from information loss as data propagates through deeper layers. (2) Inefficient integration of local details and global context between the encoder and decoder stages. To address these challenges, we propose the MACMD-based decoder, which enhances attention mechanisms and facilitates channel mixing between encoder and decoder stages via skip connections. This design leverages hierarchical dilated convolutions, attention-driven modulation, and a cross channel-mixing module to capture long-range dependencies while preserving local contextual details, essential for precise medical image segmentation. We evaluated our approach using multiple transformer encoders on both binary and multi-organ segmentation tasks. The results demonstrate that our method outperforms state-of-the-art approaches in terms of Dice score and computational efficiency, highlighting its effectiveness in achieving accurate and robust segmentation performance. The code available at https://github.com/lalitmaurya47/MACMD
