Table of Contents
Fetching ...

Dual Relation Mining Network for Zero-Shot Learning

Jinwei Han, Yingguo Gao, Zhiwen Lin, Ke Yan, Shouhong Ding, Yuan Gao, Gui-Song Xia

TL;DR

This work tackles zero-shot learning by addressing both visual–semantic alignment and the underexplored semantic relationships among attributes. It introduces the Dual Relation Mining Network (DRMN), which combines a Dual Attention Block (DAB) for enriched visual features and region-attribute spatial attention with attribute-guided channel attention, a Semantic Interaction Transformer (SIT) for inter-attribute modeling, and a global classification branch to capture latent cues. The hyperspherical classifier unifies attribute and semantic scores, and an ensemble strategy combines attribute-based and global predictions for Generalized ZSL. Empirical results on CUB, SUN, and AwA2 set new state-of-the-art performance in CZSL and GZSL, validating the effectiveness of dual-relational mining for robust knowledge transfer in unseen classes.

Abstract

Zero-shot learning (ZSL) aims to recognize novel classes through transferring shared semantic knowledge (e.g., attributes) from seen classes to unseen classes. Recently, attention-based methods have exhibited significant progress which align visual features and attributes via a spatial attention mechanism. However, these methods only explore visual-semantic relationship in the spatial dimension, which can lead to classification ambiguity when different attributes share similar attention regions, and semantic relationship between attributes is rarely discussed. To alleviate the above problems, we propose a Dual Relation Mining Network (DRMN) to enable more effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer. Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion and conducts spatial attention for visual to semantic embedding. Moreover, an attribute-guided channel attention is utilized to decouple entangled semantic features. For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images. Additionally, a global classification branch is introduced as a complement to human-defined semantic attributes, and we then combine the results with attribute-based classification. Extensive experiments demonstrate that the proposed DRMN leads to new state-of-the-art performances on three standard ZSL benchmarks, i.e., CUB, SUN, and AwA2.

Dual Relation Mining Network for Zero-Shot Learning

TL;DR

This work tackles zero-shot learning by addressing both visual–semantic alignment and the underexplored semantic relationships among attributes. It introduces the Dual Relation Mining Network (DRMN), which combines a Dual Attention Block (DAB) for enriched visual features and region-attribute spatial attention with attribute-guided channel attention, a Semantic Interaction Transformer (SIT) for inter-attribute modeling, and a global classification branch to capture latent cues. The hyperspherical classifier unifies attribute and semantic scores, and an ensemble strategy combines attribute-based and global predictions for Generalized ZSL. Empirical results on CUB, SUN, and AwA2 set new state-of-the-art performance in CZSL and GZSL, validating the effectiveness of dual-relational mining for robust knowledge transfer in unseen classes.

Abstract

Zero-shot learning (ZSL) aims to recognize novel classes through transferring shared semantic knowledge (e.g., attributes) from seen classes to unseen classes. Recently, attention-based methods have exhibited significant progress which align visual features and attributes via a spatial attention mechanism. However, these methods only explore visual-semantic relationship in the spatial dimension, which can lead to classification ambiguity when different attributes share similar attention regions, and semantic relationship between attributes is rarely discussed. To alleviate the above problems, we propose a Dual Relation Mining Network (DRMN) to enable more effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer. Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion and conducts spatial attention for visual to semantic embedding. Moreover, an attribute-guided channel attention is utilized to decouple entangled semantic features. For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images. Additionally, a global classification branch is introduced as a complement to human-defined semantic attributes, and we then combine the results with attribute-based classification. Extensive experiments demonstrate that the proposed DRMN leads to new state-of-the-art performances on three standard ZSL benchmarks, i.e., CUB, SUN, and AwA2.
Paper Structure (16 sections, 10 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 10 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Motivation illustration. (a) Different attributes may share similar attention areas, posing challenges for attribute prediction. (b) The appearance of the same attribute can vary, while different attributes may share similar semantic information that can be leveraged to facilitate knowledge transfer.
  • Figure 2: The architecture of the proposed scheme DRMN. DRMN consists of a Dual Attention Block (DAB), a Semantic Interaction Transformer (SIT), and a global classification branch. DAB fuses multi-level visual features and employs spatial and channel attention mechanisms for visual to semantic embedding. SIT models semantic relationship to enhance the generalization of attribute representations for knowledge transfer. The predicted attributes are projected onto the hyperspherical space for classification. The global classification branch complements the human-defined attributes and we combine the results with attribute-based classification.
  • Figure 3: The DAB models the visual-semantic relationship via Multi-level Spatial Attention and Attribute-guided Channel Attention. The disentangled semantic features are more beneficial for classification and knowledge transfer.
  • Figure 4: Visualization of attention maps for the baseline and our DRMN.
  • Figure 5: Visualization of channel-wise attention weights learned by our Attribute-guided Channel Attention. The channel-wise attention weights and attributes are randomly selected.
  • ...and 2 more figures