Table of Contents
Fetching ...

Semantic-embedded Similarity Prototype for Scene Recognition

Chuanxin Song, Hanbo Wu, Xin Ma, Yibin Li

TL;DR

This work tackles the challenge of high inter-class similarity in scene recognition and the heavy computational cost of object-centric cues by introducing a semantic-embedded similarity prototype. The prototype constructs class-level semantic representations from pixel-level segmentation and derives inter-class correlations via cosine and Euclidean-distance transforms, forming a priors matrix $S \in \mathbb{R}^{C \times C}$. It is then employed through two plug-and-play training strategies: Gradient Label Softening (GLS), which softens labels using $S$ with a progressive confidence schedule $\sigma'$, and Batch-level Contrastive Loss (BCL), which uses $S$ to shape inter- and intra-class constraints in mini-batches. Across MIT-67, SUN397, and Places365-7/14, GLS and BCL improve accuracy across multiple backbones without increasing inference costs, and even boost performance when integrated into DGN-Net, underscoring the method’s practicality for edge devices and real-world deployment.

Abstract

Due to the high inter-class similarity caused by the complex composition and the co-existing objects across scenes, numerous studies have explored object semantic knowledge within scenes to improve scene recognition. However, a resulting challenge emerges as object information extraction techniques require heavy computational costs, thereby burdening the network considerably. This limitation often renders object-assisted approaches incompatible with edge devices in practical deployment. In contrast, this paper proposes a semantic knowledge-based similarity prototype, which can help the scene recognition network achieve superior accuracy without increasing the computational cost in practice. It is simple and can be plug-and-played into existing pipelines. More specifically, a statistical strategy is introduced to depict semantic knowledge in scenes as class-level semantic representations. These representations are used to explore correlations between scene classes, ultimately constructing a similarity prototype. Furthermore, we propose to leverage the similarity prototype to support network training from the perspective of Gradient Label Softening and Batch-level Contrastive Loss, respectively. Comprehensive evaluations on multiple benchmarks show that our similarity prototype enhances the performance of existing networks, all while avoiding any additional computational burden in practical deployments. Code and the statistical similarity prototype will be available at https://github.com/ChuanxinSong/SimilarityPrototype

Semantic-embedded Similarity Prototype for Scene Recognition

TL;DR

This work tackles the challenge of high inter-class similarity in scene recognition and the heavy computational cost of object-centric cues by introducing a semantic-embedded similarity prototype. The prototype constructs class-level semantic representations from pixel-level segmentation and derives inter-class correlations via cosine and Euclidean-distance transforms, forming a priors matrix . It is then employed through two plug-and-play training strategies: Gradient Label Softening (GLS), which softens labels using with a progressive confidence schedule , and Batch-level Contrastive Loss (BCL), which uses to shape inter- and intra-class constraints in mini-batches. Across MIT-67, SUN397, and Places365-7/14, GLS and BCL improve accuracy across multiple backbones without increasing inference costs, and even boost performance when integrated into DGN-Net, underscoring the method’s practicality for edge devices and real-world deployment.

Abstract

Due to the high inter-class similarity caused by the complex composition and the co-existing objects across scenes, numerous studies have explored object semantic knowledge within scenes to improve scene recognition. However, a resulting challenge emerges as object information extraction techniques require heavy computational costs, thereby burdening the network considerably. This limitation often renders object-assisted approaches incompatible with edge devices in practical deployment. In contrast, this paper proposes a semantic knowledge-based similarity prototype, which can help the scene recognition network achieve superior accuracy without increasing the computational cost in practice. It is simple and can be plug-and-played into existing pipelines. More specifically, a statistical strategy is introduced to depict semantic knowledge in scenes as class-level semantic representations. These representations are used to explore correlations between scene classes, ultimately constructing a similarity prototype. Furthermore, we propose to leverage the similarity prototype to support network training from the perspective of Gradient Label Softening and Batch-level Contrastive Loss, respectively. Comprehensive evaluations on multiple benchmarks show that our similarity prototype enhances the performance of existing networks, all while avoiding any additional computational burden in practical deployments. Code and the statistical similarity prototype will be available at https://github.com/ChuanxinSong/SimilarityPrototype
Paper Structure (28 sections, 17 equations, 8 figures, 9 tables, 2 algorithms)

This paper contains 28 sections, 17 equations, 8 figures, 9 tables, 2 algorithms.

Figures (8)

  • Figure 1: Similarity due to object co-occurrence in scene recognition. The rightmost column represents the probability statistics for the top five occurrences of the object in the scene. Obviously, "Auditorium" and "Concert_hall" are extremely similar, while "Bedroom" is different from them.
  • Figure 2: An illustration of the overall process of making a similarity prototype for a scene dataset. Different scene categories are represented by different colored blocks. The derived similarity prototype is denoted as a matrix with dimensions equal to the number of scene categories, with all diagonal elements being 1. $S_{i,j}$ denotes the label correlation between two scene classes.
  • Figure 3: An example of the cosine-based similarity prototype for the Places365-14 dataset. Inter-class label correlations are quantified in the similarity prototype. Darker colors indicate stronger similarity between corresponding scene categories; lighter colors indicate weaker similarity between scene categories.
  • Figure 4: Taking the cosine-based similarity prototype for the Places365-7 as an example, $S_{norm}^1$ varies with epoch growing. As the target category confidence increases, attention towards the non-target category gradually decreases. In the process, $S_{norm}^1$ still maintains the difference in attention towards different non-target categories. Once $STEP$ epochs have passed, all class labels are converted to a hard label.
  • Figure 5: An illustration of the operation of the proposed Contrastive Loss function guided by the proposed Similarity Prototype. Different scene categories are represented by different colored blocks. $S_{i,j}$ denotes the label correlation between two scene classes.
  • ...and 3 more figures