Feature boosting with efficient attention for scene parsing
Vivek Singh, Shailza Sharma, Fabio Cuzzolin
TL;DR
The paper addresses semantic scene parsing in open environments with many classes by modeling spatial context across multi-scale feature representations. It introduces FBNet, which fuses multi-level backbone features with a lightweight Spatial Attention Module (SAM) and a Channel Attention Module (CAM), and uses an auxiliary task to guide SAM with coarse global structure. CAM reweights per-pixel channel information while SAM captures broad spatial relationships efficiently, and both operate before the final per-pixel classifier. Empirical results on ADE20K and Cityscapes show state-of-the-art performance with a smaller parameter count, validating the approach and its potential for efficient, context-aware scene parsing.
Abstract
The complexity of scene parsing grows with the number of object and scene classes, which is higher in unrestricted open scenes. The biggest challenge is to model the spatial relation between scene elements while succeeding in identifying objects at smaller scales. This paper presents a novel feature-boosting network that gathers spatial context from multiple levels of feature extraction and computes the attention weights for each level of representation to generate the final class labels. A novel `channel attention module' is designed to compute the attention weights, ensuring that features from the relevant extraction stages are boosted while the others are attenuated. The model also learns spatial context information at low resolution to preserve the abstract spatial relationships among scene elements and reduce computation cost. Spatial attention is subsequently concatenated into a final feature set before applying feature boosting. Low-resolution spatial attention features are trained using an auxiliary task that helps learning a coarse global scene structure. The proposed model outperforms all state-of-the-art models on both the ADE20K and the Cityscapes datasets.
