Table of Contents
Fetching ...

SRRM: Semantic Region Relation Model for Indoor Scene Recognition

Chuanxin Song, Xin Ma

TL;DR

Indoor scene recognition is hampered by complex co-occurring objects and semantic ambiguity. The authors propose SRRM, which models semantic region relations directly from semantic segmentation score tensors and uses adaptive confidence filtering to mitigate segmentation errors. They further introduce CSRRM, a two-stream architecture that fuses SRRM with a PlacesCNN RGB branch via a Global Integration Module based on depth-wise convolution to leverage complementary information. Across MIT-67, Places365-7/14, and SUN RGB-D, CSRRM achieves state-of-the-art performance without retraining, highlighting improved interpretability and efficiency and providing a public implementation.

Abstract

Despite the remarkable success of convolutional neural networks in various computer vision tasks, recognizing indoor scenes still presents a significant challenge due to their complex composition. Consequently, effectively leveraging semantic information in the scene has been a key issue in advancing indoor scene recognition. Unfortunately, the accuracy of semantic segmentation has limited the effectiveness of existing approaches for leveraging semantic information. As a result, many of these approaches remain at the stage of auxiliary labeling or co-occurrence statistics, with few exploring the contextual relationships between the semantic elements directly within the scene. In this paper, we propose the Semantic Region Relationship Model (SRRM), which starts directly from the semantic information inside the scene. Specifically, SRRM adopts an adaptive and efficient approach to mitigate the negative impact of semantic ambiguity and then models the semantic region relationship to perform scene recognition. Additionally, to more comprehensively exploit the information contained in the scene, we combine the proposed SRRM with the PlacesCNN module to create the Combined Semantic Region Relation Model (CSRRM), and propose a novel information combining approach to effectively explore the complementary contents between them. CSRRM significantly outperforms the SOTA methods on the MIT Indoor 67, reduced Places365 dataset, and SUN RGB-D without retraining. The code is available at: https://github.com/ChuanxinSong/SRRM

SRRM: Semantic Region Relation Model for Indoor Scene Recognition

TL;DR

Indoor scene recognition is hampered by complex co-occurring objects and semantic ambiguity. The authors propose SRRM, which models semantic region relations directly from semantic segmentation score tensors and uses adaptive confidence filtering to mitigate segmentation errors. They further introduce CSRRM, a two-stream architecture that fuses SRRM with a PlacesCNN RGB branch via a Global Integration Module based on depth-wise convolution to leverage complementary information. Across MIT-67, Places365-7/14, and SUN RGB-D, CSRRM achieves state-of-the-art performance without retraining, highlighting improved interpretability and efficiency and providing a public implementation.

Abstract

Despite the remarkable success of convolutional neural networks in various computer vision tasks, recognizing indoor scenes still presents a significant challenge due to their complex composition. Consequently, effectively leveraging semantic information in the scene has been a key issue in advancing indoor scene recognition. Unfortunately, the accuracy of semantic segmentation has limited the effectiveness of existing approaches for leveraging semantic information. As a result, many of these approaches remain at the stage of auxiliary labeling or co-occurrence statistics, with few exploring the contextual relationships between the semantic elements directly within the scene. In this paper, we propose the Semantic Region Relationship Model (SRRM), which starts directly from the semantic information inside the scene. Specifically, SRRM adopts an adaptive and efficient approach to mitigate the negative impact of semantic ambiguity and then models the semantic region relationship to perform scene recognition. Additionally, to more comprehensively exploit the information contained in the scene, we combine the proposed SRRM with the PlacesCNN module to create the Combined Semantic Region Relation Model (CSRRM), and propose a novel information combining approach to effectively explore the complementary contents between them. CSRRM significantly outperforms the SOTA methods on the MIT Indoor 67, reduced Places365 dataset, and SUN RGB-D without retraining. The code is available at: https://github.com/ChuanxinSong/SRRM
Paper Structure (13 sections, 4 equations, 7 figures, 6 tables)

This paper contains 13 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Some examples of different datasets (indoor scene, object, outdoor scene).
  • Figure 2: Semantic Region Relation Model(SRRM), where the part surrounded by the red dashed box represents the confidence filtering stage, is used to deal with the semantic segmentation error problem.
  • Figure 3: The Adaptive confidence filtering process in a single channel.
  • Figure 4: ResBlock + ChAM.
  • Figure 5: The proposed Combined Semantic Region Relation Model (CSRRM) contains two streams. The stream with the red arrow is the proposed SRRM that uses the semantic segmentation score tensor for scene recognition, the other stream with the black arrow is the PlacesCNN module that uses the raw RGB image for scene recognition.
  • ...and 2 more figures