Table of Contents
Fetching ...

A Deep Learning-based Global and Segmentation-based Semantic Feature Fusion Approach for Indoor Scene Classification

Ricardo Pereira, Tiago Barros, Luis Garrote, Ana Lopes, Urbano J. Nunes

TL;DR

The paper tackles indoor scene classification by introducing segmentation-based semantic features (SSFs) that capture the 2D spatial distribution of segmentation-categories from semantic masks. It proposes GS$^2$F$^2$App, a two-branch CNN that fuses CNN-based global RGB features with SSFs-CNN features derived from SSFs, enabling a rich representation of scenes. Evaluations on SUN RGB-D and NYUv2 show state-of-the-art accuracy (62.3% and 77.8%, respectively), with ablations demonstrating the effectiveness of SSFs, especially when processed with convolutional layers, and the robustness to backbone choice. The approach offers a computationally efficient path to improved indoor scene understanding and can be extended by integrating segmentation-based cues with object-detection–derived semantics for even richer representations.

Abstract

This work proposes a novel approach that uses a semantic segmentation mask to obtain a 2D spatial layout of the segmentation-categories across the scene, designated by segmentation-based semantic features (SSFs). These features represent, per segmentation-category, the pixel count, as well as the 2D average position and respective standard deviation values. Moreover, a two-branch network, GS2F2App, that exploits CNN-based global features extracted from RGB images and the segmentation-based features extracted from the proposed SSFs, is also proposed. GS2F2App was evaluated in two indoor scene benchmark datasets: the SUN RGB-D and the NYU Depth V2, achieving state-of-the-art results on both datasets.

A Deep Learning-based Global and Segmentation-based Semantic Feature Fusion Approach for Indoor Scene Classification

TL;DR

The paper tackles indoor scene classification by introducing segmentation-based semantic features (SSFs) that capture the 2D spatial distribution of segmentation-categories from semantic masks. It proposes GSFApp, a two-branch CNN that fuses CNN-based global RGB features with SSFs-CNN features derived from SSFs, enabling a rich representation of scenes. Evaluations on SUN RGB-D and NYUv2 show state-of-the-art accuracy (62.3% and 77.8%, respectively), with ablations demonstrating the effectiveness of SSFs, especially when processed with convolutional layers, and the robustness to backbone choice. The approach offers a computationally efficient path to improved indoor scene understanding and can be extended by integrating segmentation-based cues with object-detection–derived semantics for even richer representations.

Abstract

This work proposes a novel approach that uses a semantic segmentation mask to obtain a 2D spatial layout of the segmentation-categories across the scene, designated by segmentation-based semantic features (SSFs). These features represent, per segmentation-category, the pixel count, as well as the 2D average position and respective standard deviation values. Moreover, a two-branch network, GS2F2App, that exploits CNN-based global features extracted from RGB images and the segmentation-based features extracted from the proposed SSFs, is also proposed. GS2F2App was evaluated in two indoor scene benchmark datasets: the SUN RGB-D and the NYU Depth V2, achieving state-of-the-art results on both datasets.
Paper Structure (12 sections, 13 equations, 3 figures, 5 tables)

This paper contains 12 sections, 13 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of the proposed GS$^2$F$^2$App. The global branch (green) uses a state-of-the-art CNN to extract global features. The semantic branch (blue) generates a semantic segmentation mask, from which SSFs are extracted, which are exploited by the SSF-CNN. Both branches converge to a feature fusion module (yellow), leading to a scene class prediction.
  • Figure 2: Visual representation of the proposed SSFs, being: a) RGB image with overlaid semantic segmentation mask; b) RGB image overlaid with a 2D spatial representation of four segmentation-categories ("sofa"(n=1), "painting"(n=2), "curtain"(n=3), "table"(n=4)) through the segmentation mask. The 2D spatial representation was obtained by applying the 2D standard deviation values of each 2D position regarding each segmentation-category centered in the 2D average position per segmentation-category; c) Obtained SSFs' values. The RGB and segmentation mask images were taken from the NYU Depth Dataset V2 nyu_dataset.
  • Figure 3: Network architectures used in the ablation study. The RGB+SegMask network independently extracts DL features from an RGB image and a semantic segmentation mask that are concatenated for scene prediction. The SSFs-NN network uses fully-connected layers to exploit correlations from the proposed SSFs. The RGB+PC network extracts CNN-based features from an RGB image and uses 1D convolutional layers to exploit the SSFs-based PC features. Both extracted features are concatenated for scene prediction.