Multi-Granularity Mutual Refinement Network for Zero-Shot Learning
Ning Wang, Long Yu, Cong Hua, Guangming Zhu, Lin Mei, Syed Afaq Ali Shah, Mohammed Bennamoun, Liang Zhang
TL;DR
This work tackles zero-shot learning by addressing the limitations of global and single-scale regional features in capturing fine-grained, transferable visual cues. It introduces Mg-MRN, a three-component framework consisting of the Multi-granularity Module (MgM) for decoupled region features, the Mutual Refinement Module (MRM) for cross-granularity fusion via Spatial-Channel Attention, and a Transformer-based Visual-Semantic Decoder (VSD) for visual-to-semantic alignment. Training combines semantic cross-entropy and attribute regression losses across all granularity levels, and inference aggregates cosine-based semantic predictions from each level. Experiments on CUB, SUN, and AWA2 demonstrate state-of-the-art or competitive performance in CZSL and GZSL, with ablations confirming the contributions of MgM and MRM to better region disentanglement and attribute localization, and qualitative results corroborating improved attention and clustering of seen/unseen classes.
Abstract
Zero-shot learning (ZSL) aims to recognize unseen classes with zero samples by transferring semantic knowledge from seen classes. Current approaches typically correlate global visual features with semantic information (i.e., attributes) or align local visual region features with corresponding attributes to enhance visual-semantic interactions. Although effective, these methods often overlook the intrinsic interactions between local region features, which can further improve the acquisition of transferable and explicit visual features. In this paper, we propose a network named Multi-Granularity Mutual Refinement Network (Mg-MRN), which refine discriminative and transferable visual features by learning decoupled multi-granularity features and cross-granularity feature interactions. Specifically, we design a multi-granularity feature extraction module to learn region-level discriminative features through decoupled region feature mining. Then, a cross-granularity feature fusion module strengthens the inherent interactions between region features of varying granularities. This module enhances the discriminability of representations at each granularity level by integrating region representations from adjacent hierarchies, further improving ZSL recognition performance. Extensive experiments on three popular ZSL benchmark datasets demonstrate the superiority and competitiveness of our proposed Mg-MRN method. Our code is available at https://github.com/NingWang2049/Mg-MRN.
