Table of Contents
Fetching ...

Multi-Granularity Mutual Refinement Network for Zero-Shot Learning

Ning Wang, Long Yu, Cong Hua, Guangming Zhu, Lin Mei, Syed Afaq Ali Shah, Mohammed Bennamoun, Liang Zhang

TL;DR

This work tackles zero-shot learning by addressing the limitations of global and single-scale regional features in capturing fine-grained, transferable visual cues. It introduces Mg-MRN, a three-component framework consisting of the Multi-granularity Module (MgM) for decoupled region features, the Mutual Refinement Module (MRM) for cross-granularity fusion via Spatial-Channel Attention, and a Transformer-based Visual-Semantic Decoder (VSD) for visual-to-semantic alignment. Training combines semantic cross-entropy and attribute regression losses across all granularity levels, and inference aggregates cosine-based semantic predictions from each level. Experiments on CUB, SUN, and AWA2 demonstrate state-of-the-art or competitive performance in CZSL and GZSL, with ablations confirming the contributions of MgM and MRM to better region disentanglement and attribute localization, and qualitative results corroborating improved attention and clustering of seen/unseen classes.

Abstract

Zero-shot learning (ZSL) aims to recognize unseen classes with zero samples by transferring semantic knowledge from seen classes. Current approaches typically correlate global visual features with semantic information (i.e., attributes) or align local visual region features with corresponding attributes to enhance visual-semantic interactions. Although effective, these methods often overlook the intrinsic interactions between local region features, which can further improve the acquisition of transferable and explicit visual features. In this paper, we propose a network named Multi-Granularity Mutual Refinement Network (Mg-MRN), which refine discriminative and transferable visual features by learning decoupled multi-granularity features and cross-granularity feature interactions. Specifically, we design a multi-granularity feature extraction module to learn region-level discriminative features through decoupled region feature mining. Then, a cross-granularity feature fusion module strengthens the inherent interactions between region features of varying granularities. This module enhances the discriminability of representations at each granularity level by integrating region representations from adjacent hierarchies, further improving ZSL recognition performance. Extensive experiments on three popular ZSL benchmark datasets demonstrate the superiority and competitiveness of our proposed Mg-MRN method. Our code is available at https://github.com/NingWang2049/Mg-MRN.

Multi-Granularity Mutual Refinement Network for Zero-Shot Learning

TL;DR

This work tackles zero-shot learning by addressing the limitations of global and single-scale regional features in capturing fine-grained, transferable visual cues. It introduces Mg-MRN, a three-component framework consisting of the Multi-granularity Module (MgM) for decoupled region features, the Mutual Refinement Module (MRM) for cross-granularity fusion via Spatial-Channel Attention, and a Transformer-based Visual-Semantic Decoder (VSD) for visual-to-semantic alignment. Training combines semantic cross-entropy and attribute regression losses across all granularity levels, and inference aggregates cosine-based semantic predictions from each level. Experiments on CUB, SUN, and AWA2 demonstrate state-of-the-art or competitive performance in CZSL and GZSL, with ablations confirming the contributions of MgM and MRM to better region disentanglement and attribute localization, and qualitative results corroborating improved attention and clustering of seen/unseen classes.

Abstract

Zero-shot learning (ZSL) aims to recognize unseen classes with zero samples by transferring semantic knowledge from seen classes. Current approaches typically correlate global visual features with semantic information (i.e., attributes) or align local visual region features with corresponding attributes to enhance visual-semantic interactions. Although effective, these methods often overlook the intrinsic interactions between local region features, which can further improve the acquisition of transferable and explicit visual features. In this paper, we propose a network named Multi-Granularity Mutual Refinement Network (Mg-MRN), which refine discriminative and transferable visual features by learning decoupled multi-granularity features and cross-granularity feature interactions. Specifically, we design a multi-granularity feature extraction module to learn region-level discriminative features through decoupled region feature mining. Then, a cross-granularity feature fusion module strengthens the inherent interactions between region features of varying granularities. This module enhances the discriminability of representations at each granularity level by integrating region representations from adjacent hierarchies, further improving ZSL recognition performance. Extensive experiments on three popular ZSL benchmark datasets demonstrate the superiority and competitiveness of our proposed Mg-MRN method. Our code is available at https://github.com/NingWang2049/Mg-MRN.

Paper Structure

This paper contains 28 sections, 14 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The framework of our proposed Mg-MRN. It contains two innovations: (a) Multi-granularity Module, (2) Mutual Refinement Module. The multi-granularity module mines different grained region features from the intermediate features of the backbone network through Region Feature Mining Block (RFMB). The mutual refinement module exploit the Spatial-Channel Attention Block (SCAB) to enhance the discriminability of the representation of each granularity level by integrating the region features of the adjacent hierarchies. The bottom left and right are the detailed RFM module and SCAB module, respectively. Visual features are mapped to semantic space through the Visual-Semantic Decoder (VSD) module, which is detailed in Fig \ref{['vsd']}.
  • Figure 2: The architecture of Visual-Semantic Decoder Module.
  • Figure 3: Visualization of Area Under Unseen-Seen Accuracy (AUSUC). After the model is equipped with the MRM, its AUSUC is significantly improved. This shows that our multi-granularity mutual refinement strategy improves the knowledge transfer from seen classes to unseen classes.
  • Figure 4: The effectiveness of the number of granularity level $L$ and the number of parts $N_p$ of each granularity level on (a) CUB, (b) AwA2 and (c) SUN.
  • Figure 5: Mean and standard deviation of error distributions on the seen and unseen test set. This shows that our multi-granularity mutual refinement strategy produces precisely semantic predictions.
  • ...and 3 more figures