Table of Contents
Fetching ...

Semantic-guided modeling of spatial relation and object co-occurrence for indoor scene recognition

Chuanxin Song, Hanbo Wu, Xin Ma

TL;DR

SpaCoNet addresses indoor scene recognition by jointly modeling semantic spatial relations and object co-occurrence under semantic segmentation guidance. It introduces SSRM to extract domain-pure spatial features, an IFEM for rich object-aware cues, a Semantic Node Feature Aggregation module to assign scene features to objects, and a Global-Local Dependency Module to capture long-range co-occurrences via attention. The approach demonstrates strong gains on MIT-67, SUN397, Places, and SUN RGB-D, with ablations confirming the value of ACF, feature aggregation strategies, and the Encoder-Decoder GLDM design. This semantic-guided framework offers a scalable path to robust indoor scene understanding with competitive efficiency and broad generalization potential.

Abstract

Exploring the semantic context in scene images is essential for indoor scene recognition. However, due to the diverse intra-class spatial layouts and the coexisting inter-class objects, modeling contextual relationships to adapt various image characteristics is a great challenge. Existing contextual modeling methods for scene recognition exhibit two limitations: 1) They typically model only one type of spatial relationship (order or metric) among objects within scenes, with limited exploration of diverse spatial layouts. 2) They often overlook the differences in coexisting objects across different scenes, suppressing scene recognition performance. To overcome these limitations, we propose SpaCoNet, which simultaneously models Spatial relation and Co-occurrence of objects guided by semantic segmentation. Firstly, the Semantic Spatial Relation Module (SSRM) is constructed to model scene spatial features. With the help of semantic segmentation, this module decouples spatial information from the scene image and thoroughly explores all spatial relationships among objects in an implicit manner, thereby obtaining semantic-based spatial features. Secondly, both spatial features from the SSRM and deep features from the Image Feature Extraction Module are allocated to each object, so as to distinguish the coexisting object across different scenes. Finally, utilizing the discriminative features above, we design a Global-Local Dependency Module to explore the long-range co-occurrence among objects, and further generate a semantic-guided feature representation for indoor scene recognition. Experimental results on three widely used scene datasets demonstrate the effectiveness and generality of the proposed method.

Semantic-guided modeling of spatial relation and object co-occurrence for indoor scene recognition

TL;DR

SpaCoNet addresses indoor scene recognition by jointly modeling semantic spatial relations and object co-occurrence under semantic segmentation guidance. It introduces SSRM to extract domain-pure spatial features, an IFEM for rich object-aware cues, a Semantic Node Feature Aggregation module to assign scene features to objects, and a Global-Local Dependency Module to capture long-range co-occurrences via attention. The approach demonstrates strong gains on MIT-67, SUN397, Places, and SUN RGB-D, with ablations confirming the value of ACF, feature aggregation strategies, and the Encoder-Decoder GLDM design. This semantic-guided framework offers a scalable path to robust indoor scene understanding with competitive efficiency and broad generalization potential.

Abstract

Exploring the semantic context in scene images is essential for indoor scene recognition. However, due to the diverse intra-class spatial layouts and the coexisting inter-class objects, modeling contextual relationships to adapt various image characteristics is a great challenge. Existing contextual modeling methods for scene recognition exhibit two limitations: 1) They typically model only one type of spatial relationship (order or metric) among objects within scenes, with limited exploration of diverse spatial layouts. 2) They often overlook the differences in coexisting objects across different scenes, suppressing scene recognition performance. To overcome these limitations, we propose SpaCoNet, which simultaneously models Spatial relation and Co-occurrence of objects guided by semantic segmentation. Firstly, the Semantic Spatial Relation Module (SSRM) is constructed to model scene spatial features. With the help of semantic segmentation, this module decouples spatial information from the scene image and thoroughly explores all spatial relationships among objects in an implicit manner, thereby obtaining semantic-based spatial features. Secondly, both spatial features from the SSRM and deep features from the Image Feature Extraction Module are allocated to each object, so as to distinguish the coexisting object across different scenes. Finally, utilizing the discriminative features above, we design a Global-Local Dependency Module to explore the long-range co-occurrence among objects, and further generate a semantic-guided feature representation for indoor scene recognition. Experimental results on three widely used scene datasets demonstrate the effectiveness and generality of the proposed method.
Paper Structure (27 sections, 11 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 27 sections, 11 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Some examples of different scene datasets. Images in the bedroom and hospital room categories could easily be confused due to shared objects like beds. This inter-class similarity, caused by coexisting objects across scenes, could also be found in the classroom and restaurant (e.g., tables). In contrast, the simple composition makes the variation between the different outdoor scene categories quite obvious.
  • Figure 2: An illustration of three spatial relationships (Metric, Order, Topological) within scenes.
  • Figure 3: Pipeline of the proposed SpaCoNet for indoor scene representation. Initially, Semantic Spatial Relation Module provides the spatial feature $F_S$, while Image Feature Extraction Module provides the deep feature $F_I$. These features, along with the semantic segmentation label map $L$, are then sent to Semantic Node Feature Aggregation Module, which performs feature aggregation on $F_S$ and $F_I$ guided by $L$, resulting in two semantic feature sequences. Subsequently, Global-Local Dependency Module explores the long-range co-occurrence among semantic features through the attention mechanism, modifying the global features with the obtained information. Finally, the modified features are fed into Classifier to predict the scene category.
  • Figure 4: Semantic Spatial Relation Module (SSRM). The part surrounded by the red dashed box represents the confidence filtering stage, which is used to address semantic ambiguities. The part surrounded by the purple dashed box represents the spatial context modeling stage, which is used to implicitly model spatial features from the provided spatial information.
  • Figure 5: An example of the proposed Adaptive Confidence Filter (ACF).
  • ...and 5 more figures