Table of Contents
Fetching ...

Feature boosting with efficient attention for scene parsing

Vivek Singh, Shailza Sharma, Fabio Cuzzolin

TL;DR

The paper addresses semantic scene parsing in open environments with many classes by modeling spatial context across multi-scale feature representations. It introduces FBNet, which fuses multi-level backbone features with a lightweight Spatial Attention Module (SAM) and a Channel Attention Module (CAM), and uses an auxiliary task to guide SAM with coarse global structure. CAM reweights per-pixel channel information while SAM captures broad spatial relationships efficiently, and both operate before the final per-pixel classifier. Empirical results on ADE20K and Cityscapes show state-of-the-art performance with a smaller parameter count, validating the approach and its potential for efficient, context-aware scene parsing.

Abstract

The complexity of scene parsing grows with the number of object and scene classes, which is higher in unrestricted open scenes. The biggest challenge is to model the spatial relation between scene elements while succeeding in identifying objects at smaller scales. This paper presents a novel feature-boosting network that gathers spatial context from multiple levels of feature extraction and computes the attention weights for each level of representation to generate the final class labels. A novel `channel attention module' is designed to compute the attention weights, ensuring that features from the relevant extraction stages are boosted while the others are attenuated. The model also learns spatial context information at low resolution to preserve the abstract spatial relationships among scene elements and reduce computation cost. Spatial attention is subsequently concatenated into a final feature set before applying feature boosting. Low-resolution spatial attention features are trained using an auxiliary task that helps learning a coarse global scene structure. The proposed model outperforms all state-of-the-art models on both the ADE20K and the Cityscapes datasets.

Feature boosting with efficient attention for scene parsing

TL;DR

The paper addresses semantic scene parsing in open environments with many classes by modeling spatial context across multi-scale feature representations. It introduces FBNet, which fuses multi-level backbone features with a lightweight Spatial Attention Module (SAM) and a Channel Attention Module (CAM), and uses an auxiliary task to guide SAM with coarse global structure. CAM reweights per-pixel channel information while SAM captures broad spatial relationships efficiently, and both operate before the final per-pixel classifier. Empirical results on ADE20K and Cityscapes show state-of-the-art performance with a smaller parameter count, validating the approach and its potential for efficient, context-aware scene parsing.

Abstract

The complexity of scene parsing grows with the number of object and scene classes, which is higher in unrestricted open scenes. The biggest challenge is to model the spatial relation between scene elements while succeeding in identifying objects at smaller scales. This paper presents a novel feature-boosting network that gathers spatial context from multiple levels of feature extraction and computes the attention weights for each level of representation to generate the final class labels. A novel `channel attention module' is designed to compute the attention weights, ensuring that features from the relevant extraction stages are boosted while the others are attenuated. The model also learns spatial context information at low resolution to preserve the abstract spatial relationships among scene elements and reduce computation cost. Spatial attention is subsequently concatenated into a final feature set before applying feature boosting. Low-resolution spatial attention features are trained using an auxiliary task that helps learning a coarse global scene structure. The proposed model outperforms all state-of-the-art models on both the ADE20K and the Cityscapes datasets.
Paper Structure (12 sections, 5 equations, 6 figures, 4 tables)

This paper contains 12 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Samples images from the ADE20K dataset zhou2017scene to reflect the complexity of unrestricted natural scenes.
  • Figure 2: Complete architecture of the proposed Feature Boosting Network (FBNet).
  • Figure 3: The proposed Channel Attention Module (CAM) used in FBNet.
  • Figure 4: Plot of (a) mIOU achieved versus number of parameters for different backbones in Table \ref{['tab:ablation2']}; (b) mIoU value against number of parameters for all the models in Table \ref{['tab:resnestCity']}.
  • Figure 5: Attention maps for the FBNet trained on ADE20K dataset. First and second rows show attention maps for SAM (size: 64x86) and CAM (size: 128x171), respectively.
  • ...and 1 more figures