A Deep Learning-based Global and Segmentation-based Semantic Feature Fusion Approach for Indoor Scene Classification
Ricardo Pereira, Tiago Barros, Luis Garrote, Ana Lopes, Urbano J. Nunes
TL;DR
The paper tackles indoor scene classification by introducing segmentation-based semantic features (SSFs) that capture the 2D spatial distribution of segmentation-categories from semantic masks. It proposes GS$^2$F$^2$App, a two-branch CNN that fuses CNN-based global RGB features with SSFs-CNN features derived from SSFs, enabling a rich representation of scenes. Evaluations on SUN RGB-D and NYUv2 show state-of-the-art accuracy (62.3% and 77.8%, respectively), with ablations demonstrating the effectiveness of SSFs, especially when processed with convolutional layers, and the robustness to backbone choice. The approach offers a computationally efficient path to improved indoor scene understanding and can be extended by integrating segmentation-based cues with object-detection–derived semantics for even richer representations.
Abstract
This work proposes a novel approach that uses a semantic segmentation mask to obtain a 2D spatial layout of the segmentation-categories across the scene, designated by segmentation-based semantic features (SSFs). These features represent, per segmentation-category, the pixel count, as well as the 2D average position and respective standard deviation values. Moreover, a two-branch network, GS2F2App, that exploits CNN-based global features extracted from RGB images and the segmentation-based features extracted from the proposed SSFs, is also proposed. GS2F2App was evaluated in two indoor scene benchmark datasets: the SUN RGB-D and the NYU Depth V2, achieving state-of-the-art results on both datasets.
