Table of Contents
Fetching ...

Pyramid Attention Network for Semantic Segmentation

Hanchao Li, Pengfei Xiong, Jie An, Lingxue Wang

TL;DR

The paper tackles semantic segmentation by addressing spatial detail loss from downsampling. It introduces the Pyramid Attention Network (PAN), which combines a Feature Pyramid Attention (FPA) module for multi-scale, pixel-level attention with a Global Attention Upsample (GAU) decoder that uses high-level global context to guide low-level features. The approach achieves state-of-the-art 84.0% mIoU on PASCAL VOC 2012 without COCO pretraining and strong results on Cityscapes (78.6 mIoU) without coarse annotations, outperforming several heavy decoder architectures. PAN offers an efficient, context-aware alternative to dilated convolutions and complex decoders for high-accuracy semantic segmentation.

Abstract

A Pyramid Attention Network(PAN) is proposed to exploit the impact of global contextual information in semantic segmentation. Different from most existing works, we combine attention mechanism and spatial pyramid to extract precise dense features for pixel labeling instead of complicated dilated convolution and artificially designed decoder networks. Specifically, we introduce a Feature Pyramid Attention module to perform spatial pyramid attention structure on high-level output and combining global pooling to learn a better feature representation, and a Global Attention Upsample module on each decoder layer to provide global context as a guidance of low-level features to select category localization details. The proposed approach achieves state-of-the-art performance on PASCAL VOC 2012 and Cityscapes benchmarks with a new record of mIoU accuracy 84.0% on PASCAL VOC 2012, while training without COCO dataset.

Pyramid Attention Network for Semantic Segmentation

TL;DR

The paper tackles semantic segmentation by addressing spatial detail loss from downsampling. It introduces the Pyramid Attention Network (PAN), which combines a Feature Pyramid Attention (FPA) module for multi-scale, pixel-level attention with a Global Attention Upsample (GAU) decoder that uses high-level global context to guide low-level features. The approach achieves state-of-the-art 84.0% mIoU on PASCAL VOC 2012 without COCO pretraining and strong results on Cityscapes (78.6 mIoU) without coarse annotations, outperforming several heavy decoder architectures. PAN offers an efficient, context-aware alternative to dilated convolutions and complex decoders for high-accuracy semantic segmentation.

Abstract

A Pyramid Attention Network(PAN) is proposed to exploit the impact of global contextual information in semantic segmentation. Different from most existing works, we combine attention mechanism and spatial pyramid to extract precise dense features for pixel labeling instead of complicated dilated convolution and artificially designed decoder networks. Specifically, we introduce a Feature Pyramid Attention module to perform spatial pyramid attention structure on high-level output and combining global pooling to learn a better feature representation, and a Global Attention Upsample module on each decoder layer to provide global context as a guidance of low-level features to select category localization details. The proposed approach achieves state-of-the-art performance on PASCAL VOC 2012 and Cityscapes benchmarks with a new record of mIoU accuracy 84.0% on PASCAL VOC 2012, while training without COCO dataset.

Paper Structure

This paper contains 13 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Visualization results on VOC dataseteveringham2010pascal. As we can see, FCN baseline model has difficulty in making predictions on small parts of objects and details. On the first row the bicycle handle is missing and the animal is predicted to be another wrong category on the second row. Our Feature Pyramid Attention(FPA) module and Global Attention Upsample(GAU) module are designed to increase receptive field and recover pixel localization details effectively.
  • Figure 2: Overview of the Pyramid Attention Network. We use ResNet-101 to extract dense features. Then we perform FPA and GAU to extract precise pixel prediction and localization details. The blue and red lines represent the downsample and upsample operators respectively.
  • Figure 3: Feature Pyramid Attention module structure. (a) Spatial Pyramid Pooling structure. (b) Feature Pyramid Attention module. '$4\times4$, $8\times8$, $16\times16$, $32\times32$' means the resolution of feature map. The dotted box means the global pooling branch. The blue and red lines represent the downsample and upsample operators respectively. Note that all Convolution layers are followed by batch normalization.
  • Figure 4: Global Attention Upsample module structure