Table of Contents
Fetching ...

MaskAttn-UNet: A Mask Attention-Driven Framework for Universal Low-Resolution Image Segmentation

Anzhe Cheng, Chenzhong Yin, Yu Chang, Heng Ping, Shixuan Li, Shahin Nazarian, Paul Bogdan

TL;DR

MaskAttn-UNet introduces a mask attention–driven enhancement to the U‑Net framework to address low-resolution segmentation. By inserting a Mask Attention Module that performs masked multi‑head self‑attention at multiple scales and coupling it with a composite $L_{seg} = L_{CE} + \lambda L_{IC}$ loss, the approach balances local detail with long-range context while maintaining computational efficiency. Empirical results across COCO, ADE20K, and Cityscapes show competitive semantic, instance, and panoptic performance at $128\times128$ with markedly lower FLOPs than transformer-based models, and strong data efficiency when training data is limited. The method offers a practical, scalable solution for real-world robotics, autonomous driving, and AR tasks where resources are constrained, without sacrificing segmentation quality. Future work may extend MaskAttn-UNet to medical imaging and diffusion-based augmentation to further improve boundary delineation and small object handling.

Abstract

Low-resolution image segmentation is crucial in real-world applications such as robotics, augmented reality, and large-scale scene understanding, where high-resolution data is often unavailable due to computational constraints. To address this challenge, we propose MaskAttn-UNet, a novel segmentation framework that enhances the traditional U-Net architecture via a mask attention mechanism. Our model selectively emphasizes important regions while suppressing irrelevant backgrounds, thereby improving segmentation accuracy in cluttered and complex scenes. Unlike conventional U-Net variants, MaskAttn-UNet effectively balances local feature extraction with broader contextual awareness, making it particularly well-suited for low-resolution inputs. We evaluate our approach on three benchmark datasets with input images rescaled to 128x128 and demonstrate competitive performance across semantic, instance, and panoptic segmentation tasks. Our results show that MaskAttn-UNet achieves accuracy comparable to state-of-the-art methods at significantly lower computational cost than transformer-based models, making it an efficient and scalable solution for low-resolution segmentation in resource-constrained scenarios.

MaskAttn-UNet: A Mask Attention-Driven Framework for Universal Low-Resolution Image Segmentation

TL;DR

MaskAttn-UNet introduces a mask attention–driven enhancement to the U‑Net framework to address low-resolution segmentation. By inserting a Mask Attention Module that performs masked multi‑head self‑attention at multiple scales and coupling it with a composite loss, the approach balances local detail with long-range context while maintaining computational efficiency. Empirical results across COCO, ADE20K, and Cityscapes show competitive semantic, instance, and panoptic performance at with markedly lower FLOPs than transformer-based models, and strong data efficiency when training data is limited. The method offers a practical, scalable solution for real-world robotics, autonomous driving, and AR tasks where resources are constrained, without sacrificing segmentation quality. Future work may extend MaskAttn-UNet to medical imaging and diffusion-based augmentation to further improve boundary delineation and small object handling.

Abstract

Low-resolution image segmentation is crucial in real-world applications such as robotics, augmented reality, and large-scale scene understanding, where high-resolution data is often unavailable due to computational constraints. To address this challenge, we propose MaskAttn-UNet, a novel segmentation framework that enhances the traditional U-Net architecture via a mask attention mechanism. Our model selectively emphasizes important regions while suppressing irrelevant backgrounds, thereby improving segmentation accuracy in cluttered and complex scenes. Unlike conventional U-Net variants, MaskAttn-UNet effectively balances local feature extraction with broader contextual awareness, making it particularly well-suited for low-resolution inputs. We evaluate our approach on three benchmark datasets with input images rescaled to 128x128 and demonstrate competitive performance across semantic, instance, and panoptic segmentation tasks. Our results show that MaskAttn-UNet achieves accuracy comparable to state-of-the-art methods at significantly lower computational cost than transformer-based models, making it an efficient and scalable solution for low-resolution segmentation in resource-constrained scenarios.

Paper Structure

This paper contains 22 sections, 8 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Overview of the proposed MaskAttn-UNet. (a) Overall architecture with a U-Net encoder-decoder and skip connections. (b) Mask Attention Module applying a learnable mask to modulate self-attention. (c) Multi-scale encoder–decoder design with convolutional layers, mask attention at each scale and skip connections between encoder and decoder.
  • Figure 2: Visualization of segmentation results on (a) COCO and (b) ADE20K. For each dataset, the left two columns show semantic segmentation, and the right two columns show instance segmentation. The top row in each block is the input image, followed by the ground truth, and then predictions from different methods.
  • Figure 3: Segmentation performance of MaskAttn-UNet on different fractions (10%, 25%, 50%, 75%, 100%) of the panoptic_train2017 dataset. Results illustrate consistent improvement across metrics with increasing dataset size, highlighting the model's strong data efficiency.
  • Figure 4: Trend of the combined loss as a function of $\lambda$. The model only ran 20 epochs, not fully trained. The minimum loss is observed at $\lambda=0.5$.
  • Figure 5: Visualization of low-resolution segmentation results(a) Sample semantic segmentation on $64\times64$ resolution. (b) Semantic segmentation on $64\times48$ resolution.(c) Semantic segmentation on $32\times32$ resolution.
  • ...and 2 more figures