Table of Contents
Fetching ...

HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation

Guoan Xu, Wenjing Jia, Tao Wu, Ligeng Chen, Guangwei Gao

TL;DR

HAFormer addresses the challenge of lightweight semantic segmentation by integrating hierarchy-aware CNN features with an Efficient Transformer to balance local detail and global context. It introduces three key components: the Hierarchy-Aware Pixel-Excitation (HAPE) module for multi-scale local features, an Efficient Transformer (ET) that reduces quadratic self-attention costs, and a Correlation-weighted Fusion (cwF) mechanism to fuse CNN and Transformer features effectively. The approach achieves competitive accuracy with low FLOPs and high inference speed, reporting Cityscapes mIoU around 74.2 and CamVid around 71.1 with fps near 105 on a single RTX 2080 Ti. These results demonstrate a practical trade-off between accuracy and efficiency, highlighting HAFormer’s potential for real-world, resource-constrained urban-scene segmentation. The work contributes a cohesive framework that can guide future lightweight designs combining CNN inductive bias with Transformer global modeling.

Abstract

Both Convolutional Neural Networks (CNNs) and Transformers have shown great success in semantic segmentation tasks. Efforts have been made to integrate CNNs with Transformer models to capture both local and global context interactions. However, there is still room for enhancement, particularly when considering constraints on computational resources. In this paper, we introduce HAFormer, a model that combines the hierarchical features extraction ability of CNNs with the global dependency modeling capability of Transformers to tackle lightweight semantic segmentation challenges. Specifically, we design a Hierarchy-Aware Pixel-Excitation (HAPE) module for adaptive multi-scale local feature extraction. During the global perception modeling, we devise an Efficient Transformer (ET) module streamlining the quadratic calculations associated with traditional Transformers. Moreover, a correlation-weighted Fusion (cwF) module selectively merges diverse feature representations, significantly enhancing predictive accuracy. HAFormer achieves high performance with minimal computational overhead and compact model size, achieving 74.2% mIoU on Cityscapes and 71.1% mIoU on CamVid test datasets, with frame rates of 105FPS and 118FPS on a single 2080Ti GPU. The source codes are available at https://github.com/XU-GITHUB-curry/HAFormer.

HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation

TL;DR

HAFormer addresses the challenge of lightweight semantic segmentation by integrating hierarchy-aware CNN features with an Efficient Transformer to balance local detail and global context. It introduces three key components: the Hierarchy-Aware Pixel-Excitation (HAPE) module for multi-scale local features, an Efficient Transformer (ET) that reduces quadratic self-attention costs, and a Correlation-weighted Fusion (cwF) mechanism to fuse CNN and Transformer features effectively. The approach achieves competitive accuracy with low FLOPs and high inference speed, reporting Cityscapes mIoU around 74.2 and CamVid around 71.1 with fps near 105 on a single RTX 2080 Ti. These results demonstrate a practical trade-off between accuracy and efficiency, highlighting HAFormer’s potential for real-world, resource-constrained urban-scene segmentation. The work contributes a cohesive framework that can guide future lightweight designs combining CNN inductive bias with Transformer global modeling.

Abstract

Both Convolutional Neural Networks (CNNs) and Transformers have shown great success in semantic segmentation tasks. Efforts have been made to integrate CNNs with Transformer models to capture both local and global context interactions. However, there is still room for enhancement, particularly when considering constraints on computational resources. In this paper, we introduce HAFormer, a model that combines the hierarchical features extraction ability of CNNs with the global dependency modeling capability of Transformers to tackle lightweight semantic segmentation challenges. Specifically, we design a Hierarchy-Aware Pixel-Excitation (HAPE) module for adaptive multi-scale local feature extraction. During the global perception modeling, we devise an Efficient Transformer (ET) module streamlining the quadratic calculations associated with traditional Transformers. Moreover, a correlation-weighted Fusion (cwF) module selectively merges diverse feature representations, significantly enhancing predictive accuracy. HAFormer achieves high performance with minimal computational overhead and compact model size, achieving 74.2% mIoU on Cityscapes and 71.1% mIoU on CamVid test datasets, with frame rates of 105FPS and 118FPS on a single 2080Ti GPU. The source codes are available at https://github.com/XU-GITHUB-curry/HAFormer.
Paper Structure (16 sections, 17 equations, 10 figures, 8 tables)

This paper contains 16 sections, 17 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Visual comparison of small object segmentation using our approach versus an existing method on sample images from Cityscapes (top) and CamVid (bottom).
  • Figure 2: The overall architecture of the proposed HAFormer. HAFormer introduces a Hierarchy-Aware Pixel-Excitation (HAPE) module for adaptive multi-scale local feature extraction. For global perception modeling, HAFormer develops an efficient Transformer module to streamline the quadratic calculations. Additionally, a correlation-weighted Fusion (cwF) module selectively combines diverse feature representations, markedly boosting predictive accuracy.
  • Figure 3: The architecture of our Hierarchy-Aware Pixel-Excitation (HAPE). $DC$ stands for dilation convolution.
  • Figure 4: Supplementary instructions of hierarchical respective fields for HAPE.
  • Figure 5: The architecture of the proposed efficient Multi-Head Self-Attention (eMHSA).
  • ...and 5 more figures