Table of Contents
Fetching ...

Dual Expert Distillation Network for Generalized Zero-Shot Learning

Zhijie Rao, Jingcai Guo, Xiaocheng Lu, Jingming Liang, Jie Zhang, Haozhao Wang, Kang Wei, Xiaofeng Cao

TL;DR

Generalized Zero-Shot Learning is challenged by attribute asymmetry and underutilized channel information. The authors propose Dual Expert Distillation Network (DEDN), pairing a coarse global expert (cExp) with a cluster-aware fine expert (fExp), guided by a Dual Attention Network (DAN) backbone and Margin-Aware Loss (MAL). Mutual distillation between the two experts, along with region-channel attention and cluster-based specialization, yields state-of-the-art results on CUB, SUN, and AWA2 in both ZSL and GZSL. The work demonstrates that explicitly modeling attribute heterogeneity and leveraging both region and channel cues significantly improves fine-grained visual-attribute correlations with practical cross-domain recognition benefits.

Abstract

Zero-shot learning has consistently yielded remarkable progress via modeling nuanced one-to-one visual-attribute correlation. Existing studies resort to refining a uniform mapping function to align and correlate the sample regions and subattributes, ignoring two crucial issues: 1) the inherent asymmetry of attributes; and 2) the unutilized channel information. This paper addresses these issues by introducing a simple yet effective approach, dubbed Dual Expert Distillation Network (DEDN), where two experts are dedicated to coarse- and fine-grained visual-attribute modeling, respectively. Concretely, one coarse expert, namely cExp, has a complete perceptual scope to coordinate visual-attribute similarity metrics across dimensions, and moreover, another fine expert, namely fExp, consists of multiple specialized subnetworks, each corresponds to an exclusive set of attributes. Two experts cooperatively distill from each other to reach a mutual agreement during training. Meanwhile, we further equip DEDN with a newly designed backbone network, i.e., Dual Attention Network (DAN), which incorporates both region and channel attention information to fully exploit and leverage visual semantic knowledge. Experiments on various benchmark datasets indicate a new state-of-the-art.

Dual Expert Distillation Network for Generalized Zero-Shot Learning

TL;DR

Generalized Zero-Shot Learning is challenged by attribute asymmetry and underutilized channel information. The authors propose Dual Expert Distillation Network (DEDN), pairing a coarse global expert (cExp) with a cluster-aware fine expert (fExp), guided by a Dual Attention Network (DAN) backbone and Margin-Aware Loss (MAL). Mutual distillation between the two experts, along with region-channel attention and cluster-based specialization, yields state-of-the-art results on CUB, SUN, and AWA2 in both ZSL and GZSL. The work demonstrates that explicitly modeling attribute heterogeneity and leveraging both region and channel cues significantly improves fine-grained visual-attribute correlations with practical cross-domain recognition benefits.

Abstract

Zero-shot learning has consistently yielded remarkable progress via modeling nuanced one-to-one visual-attribute correlation. Existing studies resort to refining a uniform mapping function to align and correlate the sample regions and subattributes, ignoring two crucial issues: 1) the inherent asymmetry of attributes; and 2) the unutilized channel information. This paper addresses these issues by introducing a simple yet effective approach, dubbed Dual Expert Distillation Network (DEDN), where two experts are dedicated to coarse- and fine-grained visual-attribute modeling, respectively. Concretely, one coarse expert, namely cExp, has a complete perceptual scope to coordinate visual-attribute similarity metrics across dimensions, and moreover, another fine expert, namely fExp, consists of multiple specialized subnetworks, each corresponds to an exclusive set of attributes. Two experts cooperatively distill from each other to reach a mutual agreement during training. Meanwhile, we further equip DEDN with a newly designed backbone network, i.e., Dual Attention Network (DAN), which incorporates both region and channel attention information to fully exploit and leverage visual semantic knowledge. Experiments on various benchmark datasets indicate a new state-of-the-art.
Paper Structure (20 sections, 19 equations, 4 figures, 3 tables)

This paper contains 20 sections, 19 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (a) cExp, also the common practice in existing works, possesses complete attribute-awareness capability yet lacks the ability to process fine-grained semantic information. (b) fExp, which consists of multiple specialized sub-networks, lacks a global perception field.
  • Figure 2: Left:cExp possesses the scope of a holistic attribute set, while fExp consists of multiple sub-networks, each of which is responsible for the prediction of only partial attributes. We concatenate all outputs of subnetworks as the final result of fExp. Then, distillation loss is implemented to facilitate joint learning. Right: The architecture of DAN.
  • Figure 3: Visualization of the attention heat maps. The first row represents the heat maps of cExp, and the second row denotes the heat maps of fExp.
  • Figure 4: (a) Sensitivity to $\lambda_e$. (b) Sensitivity to $\lambda_{rc}$. The harmonic mean (H) is reported. (c) Comparison with Kmeans. (d) Impact of the number of attribute clusters. The harmonic mean (H) and top-1 accuracy (T) are reported.