Table of Contents
Fetching ...

Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning

Wenjin Hou, Shiming Chen, Shuhuang Chen, Ziming Hong, Yan Wang, Xuetao Feng, Salman Khan, Fahad Shahbaz Khan, Xinge You

TL;DR

The paper tackles poor generalization in generative zero-shot learning caused by Gaussian noise and static semantic prototypes. It introduces Visual-Augmented Dynamic Semantic prototype (VADS), comprising Visual-aware Domain Knowledge Learning (VDKL) and Vision-Oriented Semantic Updation (VOSU) to produce a dynamic semantic prototype $[Z', \dot{a}]$ that conditions the generator. By enriching conditioning with dataset-specific visual priors and instance-level semantic updates, VADS yields consistent improvements across CZSL and GZSL on AWA2, SUN, and CUB, and can be plugged into multiple generative ZSL backbones (e.g., CLSWGAN, TF-VAEGAN, FREE). Ablation and analysis demonstrate the necessity of both VD KL and VOSU, and show superior unseen-class feature synthesis, highlighting the method's potential for robust knowledge transfer in zero-shot learning.

Abstract

Generative Zero-shot learning (ZSL) learns a generator to synthesize visual samples for unseen classes, which is an effective way to advance ZSL. However, existing generative methods rely on the conditions of Gaussian noise and the predefined semantic prototype, which limit the generator only optimized on specific seen classes rather than characterizing each visual instance, resulting in poor generalizations (\textit{e.g.}, overfitting to seen classes). To address this issue, we propose a novel Visual-Augmented Dynamic Semantic prototype method (termed VADS) to boost the generator to learn accurate semantic-visual mapping by fully exploiting the visual-augmented knowledge into semantic conditions. In detail, VADS consists of two modules: (1) Visual-aware Domain Knowledge Learning module (VDKL) learns the local bias and global prior of the visual features (referred to as domain visual knowledge), which replace pure Gaussian noise to provide richer prior noise information; (2) Vision-Oriented Semantic Updation module (VOSU) updates the semantic prototype according to the visual representations of the samples. Ultimately, we concatenate their output as a dynamic semantic prototype, which serves as the condition of the generator. Extensive experiments demonstrate that our VADS achieves superior CZSL and GZSL performances on three prominent datasets and outperforms other state-of-the-art methods with averaging increases by 6.4\%, 5.9\% and 4.2\% on SUN, CUB and AWA2, respectively.

Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning

TL;DR

The paper tackles poor generalization in generative zero-shot learning caused by Gaussian noise and static semantic prototypes. It introduces Visual-Augmented Dynamic Semantic prototype (VADS), comprising Visual-aware Domain Knowledge Learning (VDKL) and Vision-Oriented Semantic Updation (VOSU) to produce a dynamic semantic prototype that conditions the generator. By enriching conditioning with dataset-specific visual priors and instance-level semantic updates, VADS yields consistent improvements across CZSL and GZSL on AWA2, SUN, and CUB, and can be plugged into multiple generative ZSL backbones (e.g., CLSWGAN, TF-VAEGAN, FREE). Ablation and analysis demonstrate the necessity of both VD KL and VOSU, and show superior unseen-class feature synthesis, highlighting the method's potential for robust knowledge transfer in zero-shot learning.

Abstract

Generative Zero-shot learning (ZSL) learns a generator to synthesize visual samples for unseen classes, which is an effective way to advance ZSL. However, existing generative methods rely on the conditions of Gaussian noise and the predefined semantic prototype, which limit the generator only optimized on specific seen classes rather than characterizing each visual instance, resulting in poor generalizations (\textit{e.g.}, overfitting to seen classes). To address this issue, we propose a novel Visual-Augmented Dynamic Semantic prototype method (termed VADS) to boost the generator to learn accurate semantic-visual mapping by fully exploiting the visual-augmented knowledge into semantic conditions. In detail, VADS consists of two modules: (1) Visual-aware Domain Knowledge Learning module (VDKL) learns the local bias and global prior of the visual features (referred to as domain visual knowledge), which replace pure Gaussian noise to provide richer prior noise information; (2) Vision-Oriented Semantic Updation module (VOSU) updates the semantic prototype according to the visual representations of the samples. Ultimately, we concatenate their output as a dynamic semantic prototype, which serves as the condition of the generator. Extensive experiments demonstrate that our VADS achieves superior CZSL and GZSL performances on three prominent datasets and outperforms other state-of-the-art methods with averaging increases by 6.4\%, 5.9\% and 4.2\% on SUN, CUB and AWA2, respectively.
Paper Structure (16 sections, 8 equations, 5 figures, 5 tables)

This paper contains 16 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: An illustration of the core idea of our method. (a) The semantic prototype (i.e., attribute) of different images of the same category is not fixed, so the predefined semantic prototype is inaccurate in characterizing each instance. (b) Most existing works utilize Gaussian noise and the predefined semantic prototype as conditions to train a semantic$\rightarrow$visual generator on seen classes, which fails to generalize to unseen classes. (c)(d) Our method incorporates rich visual prior with an updated semantic prototype to construct a visual-augmented dynamic semantic prototype of each instance, empowering the generator to synthesize features that faithfully represent the real distribution of unseen classes. Thus, our method achieves better generalization on seen and unseen classes than existing works (e.g., CLSWGAN xian2018feature).
  • Figure 2: The architecture of our proposed VADS. It consists of two learnable modules: a Visual-Oriented Semantic Updation module (VOSU) and a Visual-aware Domain Knowledge Learning module (VDKL). First, we obtain the prior distribution $\bm Z$ by the Visual Encoder ($\mathit VE$). Following this, the Domain Knowledge Learning network ($\mathit DKL$) transforms $\bm Z$ into a local bias $\bm b$, which is subsequently added to global learnable prior vectors ($\bm p$) to construct the domain visual prior noise (i.e., $\bm {Z^{'}}$). At the bottom, VOSU notably updates the semantic prototype in two stages (depicted by the blue and green arrows). Finally, the visual prior noise and the updated semantic prototype together form a dynamic semantic prototype, used for the reconstruction of features by the generator.
  • Figure 3: t-SNE visualizations on CUB. The 10 different colors refer to the 5 seen classes and 5 unseen classes that are randomly selected. Please zoom in for a better view.
  • Figure 4: Effect of (a) synthesized samples $N_{syn}$, (b) loss weights $\lambda_{con}$, (c) loss weights $\lambda_{kl}$, and (d) loss weights $\lambda_{sc}$ on CUB.
  • Figure 5: Visualization of the heatmap of semantic prototype similarity. We randomly select 10 classes on CUB.