Table of Contents
Fetching ...

ZeroDiff: Solidified Visual-Semantic Correlation in Zero-Shot Learning

Zihan Ye, Shreyank N. Gowda, Xiaowei Huang, Haotian Xu, Yaochu Jin, Kaizhu Huang, Xiaobo Jin

TL;DR

This work tackles zero-shot learning under limited seen-class data, where spurious visual-semantic correlations degrade feature synthesis. It introduces ZeroDiff, a diffusion-based generative framework that combines diffusion augmentation, dynamic SC-based instance semantics, and mutual learning across three discriminators to strengthen visual-semantic alignment. Empirical results on AWA2, CUB, and SUN demonstrate state-of-the-art ZSL/GZSL performance and robustness under reduced training data, validating data efficiency gains. The approach offers a practical pathway to robust ZSL in data-constrained regimes by fusing diffusion, instance-level semantics, and multi-view discriminators.

Abstract

Zero-shot Learning (ZSL) aims to enable classifiers to identify unseen classes. This is typically achieved by generating visual features for unseen classes based on learned visual-semantic correlations from seen classes. However, most current generative approaches heavily rely on having a sufficient number of samples from seen classes. Our study reveals that a scarcity of seen class samples results in a marked decrease in performance across many generative ZSL techniques. We argue, quantify, and empirically demonstrate that this decline is largely attributable to spurious visual-semantic correlations. To address this issue, we introduce ZeroDiff, an innovative generative framework for ZSL that incorporates diffusion mechanisms and contrastive representations to enhance visual-semantic correlations. ZeroDiff comprises three key components: (1) Diffusion augmentation, which naturally transforms limited data into an expanded set of noised data to mitigate generative model overfitting; (2) Supervised-contrastive (SC)-based representations that dynamically characterize each limited sample to support visual feature generation; and (3) Multiple feature discriminators employing a Wasserstein-distance-based mutual learning approach, evaluating generated features from various perspectives, including pre-defined semantics, SC-based representations, and the diffusion process. Extensive experiments on three popular ZSL benchmarks demonstrate that ZeroDiff not only achieves significant improvements over existing ZSL methods but also maintains robust performance even with scarce training data. Our codes are available at https://github.com/FouriYe/ZeroDiff_ICLR25.

ZeroDiff: Solidified Visual-Semantic Correlation in Zero-Shot Learning

TL;DR

This work tackles zero-shot learning under limited seen-class data, where spurious visual-semantic correlations degrade feature synthesis. It introduces ZeroDiff, a diffusion-based generative framework that combines diffusion augmentation, dynamic SC-based instance semantics, and mutual learning across three discriminators to strengthen visual-semantic alignment. Empirical results on AWA2, CUB, and SUN demonstrate state-of-the-art ZSL/GZSL performance and robustness under reduced training data, validating data efficiency gains. The approach offers a practical pathway to robust ZSL in data-constrained regimes by fusing diffusion, instance-level semantics, and multi-view discriminators.

Abstract

Zero-shot Learning (ZSL) aims to enable classifiers to identify unseen classes. This is typically achieved by generating visual features for unseen classes based on learned visual-semantic correlations from seen classes. However, most current generative approaches heavily rely on having a sufficient number of samples from seen classes. Our study reveals that a scarcity of seen class samples results in a marked decrease in performance across many generative ZSL techniques. We argue, quantify, and empirically demonstrate that this decline is largely attributable to spurious visual-semantic correlations. To address this issue, we introduce ZeroDiff, an innovative generative framework for ZSL that incorporates diffusion mechanisms and contrastive representations to enhance visual-semantic correlations. ZeroDiff comprises three key components: (1) Diffusion augmentation, which naturally transforms limited data into an expanded set of noised data to mitigate generative model overfitting; (2) Supervised-contrastive (SC)-based representations that dynamically characterize each limited sample to support visual feature generation; and (3) Multiple feature discriminators employing a Wasserstein-distance-based mutual learning approach, evaluating generated features from various perspectives, including pre-defined semantics, SC-based representations, and the diffusion process. Extensive experiments on three popular ZSL benchmarks demonstrate that ZeroDiff not only achieves significant improvements over existing ZSL methods but also maintains robust performance even with scarce training data. Our codes are available at https://github.com/FouriYe/ZeroDiff_ICLR25.
Paper Structure (30 sections, 26 equations, 11 figures, 4 tables, 3 algorithms)

This paper contains 30 sections, 26 equations, 11 figures, 4 tables, 3 algorithms.

Figures (11)

  • Figure 1: Core idea of our ZeroDiff. (a) Standard GAN-based ZSL approaches suffer from (1) Over-fitting to limited data; (2) Mismatched static pre-defined semantics; (3) Single-view discriminating. Finally, fewer samples in the training set lead to more spurious vision-semantic correlations and feature generation fails gradually. (b) In contrast, the proposed ZeroDiff overcomes these shortcomings using: (1) Diffusion-augmented infinite features; (2) Dynamic SC-based representations; (3) Multi-view discriminating. Finally, ZeroDiff learns substantial vision-semantic correlation and keeps a robust performance with even 10% training set.
  • Figure 2: The $\Delta_{adv}$-epoch curve for the classical f-VAEGAN xian2018feature. Larger $\Delta_{adv}$ indicates $D_{adv}$ thinks real testing seen examples are more fake, i.e., learns more spurious visual-semantic correlation.
  • Figure 3: Training pipeline of our DFG. ⓒ represents the concatenation operation. Given frozen extractors $F^{*}_{ce}$ and $F^{*}_{sc}$, we take them to extract clean visual features $\mathbf{v}_{0}$ and contrastive representations $\mathbf{r}_{0}$. Then, we use the diffusion forward chain (Eq. \ref{['eq:diff_forward_chain']}) to obtain real noised visual features $\mathbf{v}_{t-1}$ and $\mathbf{v}_{t}$. Next, $G$ in DFG denoises/generates a fake clean feature $\tilde{\mathbf{v}}_{0}$, conditioned by the concatenation of the semantic label $\mathbf{a}$, latent variable $\mathbf{z}$, diffusion time $t$, noised feature $\mathbf{v}_{t}$, and SC-based representation $\mathbf{r}_{0}$. The fake clean feature is evaluated from three different learning perspectives: adversarial learning (Does it match predefined semantics?), denoising learning ( Does it match diffusion processes?), and representation learning (Does it match contrastive representations?). Finally, we present the mutual learning loss $\mathcal{L}_{mu}$ to integrate knowledge of all discriminators.
  • Figure 4: The effect of $\mathcal{L}_{mu}$ to $\Delta_{adv}$ (Eq. \ref{['eq:delta_adv']}) and $\Delta_{diff}$ (Eq. \ref{['eq:delta_diff']}) on AWA2. (a) indicates that our $\mathcal{L}_{mu}$ mitigates the overfitting of $D_{adv}$ in the training set. (b) shows that the distinguishing ability of $D_{diff}$ is enhanced by our $\mathcal{L}_{mu}$.
  • Figure 5: Heatmap comparison of the semantic prototype similarity among (a) pre-defined semantic Reed_2016_CVPR, (b) dynamic VADS hou2024visual, and (c) our proposed SC-based representation. We randomly select 8 classes on CUB. Our method improves semantic prototypes to distinguish between categories, e.g., the similarities marked by the red dashed line.
  • ...and 6 more figures