Dual Expert Distillation Network for Generalized Zero-Shot Learning
Zhijie Rao, Jingcai Guo, Xiaocheng Lu, Jingming Liang, Jie Zhang, Haozhao Wang, Kang Wei, Xiaofeng Cao
TL;DR
Generalized Zero-Shot Learning is challenged by attribute asymmetry and underutilized channel information. The authors propose Dual Expert Distillation Network (DEDN), pairing a coarse global expert (cExp) with a cluster-aware fine expert (fExp), guided by a Dual Attention Network (DAN) backbone and Margin-Aware Loss (MAL). Mutual distillation between the two experts, along with region-channel attention and cluster-based specialization, yields state-of-the-art results on CUB, SUN, and AWA2 in both ZSL and GZSL. The work demonstrates that explicitly modeling attribute heterogeneity and leveraging both region and channel cues significantly improves fine-grained visual-attribute correlations with practical cross-domain recognition benefits.
Abstract
Zero-shot learning has consistently yielded remarkable progress via modeling nuanced one-to-one visual-attribute correlation. Existing studies resort to refining a uniform mapping function to align and correlate the sample regions and subattributes, ignoring two crucial issues: 1) the inherent asymmetry of attributes; and 2) the unutilized channel information. This paper addresses these issues by introducing a simple yet effective approach, dubbed Dual Expert Distillation Network (DEDN), where two experts are dedicated to coarse- and fine-grained visual-attribute modeling, respectively. Concretely, one coarse expert, namely cExp, has a complete perceptual scope to coordinate visual-attribute similarity metrics across dimensions, and moreover, another fine expert, namely fExp, consists of multiple specialized subnetworks, each corresponds to an exclusive set of attributes. Two experts cooperatively distill from each other to reach a mutual agreement during training. Meanwhile, we further equip DEDN with a newly designed backbone network, i.e., Dual Attention Network (DAN), which incorporates both region and channel attention information to fully exploit and leverage visual semantic knowledge. Experiments on various benchmark datasets indicate a new state-of-the-art.
