ZeroDiff++: Substantial Unseen Visual-semantic Correlation in Zero-shot Learning

Zihan Ye; Shreyank N Gowda; Kaile Du; Weijian Luo; Ling Shao

ZeroDiff++: Substantial Unseen Visual-semantic Correlation in Zero-shot Learning

Zihan Ye, Shreyank N Gowda, Kaile Du, Weijian Luo, Ling Shao

TL;DR

Extensive experiments on three ZSL benchmarks demonstrate that ZeroDiff++ not only achieves significant improvements over existing ZSL methods but also maintains robust performance even with scarce training data.

Abstract

Zero-shot Learning (ZSL) enables classifiers to recognize classes unseen during training, commonly via generative two stage methods: (1) learn visual semantic correlations from seen classes; (2) synthesize unseen class features from semantics to train classifiers. In this paper, we identify spurious visual semantic correlations in existing generative ZSL worsened by scarce seen class samples and introduce two metrics to quantify spuriousness for seen and unseen classes. Furthermore, we point out a more critical bottleneck: existing unadaptive fully noised generators produce features disconnected from real test samples, which also leads to the spurious correlation. To enhance the visual-semantic correlations on both seen and unseen classes, we propose ZeroDiff++, a diffusion-based generative framework. In training, ZeroDiff++ uses (i) diffusion augmentation to produce diverse noised samples, (ii) supervised contrastive (SC) representations for instance level semantics, and (iii) multi view discriminators with Wasserstein mutual learning to assess generated features. At generation time, we introduce (iv) Diffusion-based Test time Adaptation (DiffTTA) to adapt the generator using pseudo label reconstruction, and (v) Diffusion-based Test time Generation (DiffGen) to trace the diffusion denoising path and produce partially synthesized features that connect real and generated data, and mitigates data scarcity further. Extensive experiments on three ZSL benchmarks demonstrate that ZeroDiff++ not only achieves significant improvements over existing ZSL methods but also maintains robust performance even with scarce training data. Code would be available.

ZeroDiff++: Substantial Unseen Visual-semantic Correlation in Zero-shot Learning

TL;DR

Abstract

Paper Structure (43 sections, 3 theorems, 58 equations, 12 figures, 6 tables)

This paper contains 43 sections, 3 theorems, 58 equations, 12 figures, 6 tables.

Introduction
Related Work
ZeroDiff++
Notations
Spurious Vision-Semantic Correlation
Training Stage of ZeroDiff++
Key1: Diffusion Augmentation
Key2: SC-based Representations
Key3: Mutual-learned Discriminators
Generating Stage of ZeroDiff++
Key 4: Diffusion-based Test-time Adaptation
Key 5: Diffusion-based Test-time Generation
Experiments
Dataset
Implementation Details
...and 28 more sections

Key Result

Lemma 5.1

For any probability measures $P,Q$ with finite first moments,

Figures (12)

Figure 1: Overall motivation illustration. (a) Traditional GAN-based ZSL methods suffer from spurious visual-semantic correlation both for seen and unseen classes. (b) ZeroDiff ye2025zerodiff employs a diffusion mechanism on the training stage to obtain a substantial correlation on seen classes. (c) Our ZeroDiff++ further explores the diffusion forward chain for utilizing real testing samples, encouraging substantial visual-semantic correlation on unseen classes.
Figure 2: Detailed motivation Illustration. (a) In the training stage, three problems lead to the standard GAN-based ZSL approaches obtaining spurious seen correlation: over-fitting to limited data, mismatched static pre-defined semantics, and single-view discriminating. (b) In the generating stage, two problems corrupt unseen correlation: an unadaptive generator and fully-noised generation. Finally, spurious seen and unseen correlations lead to feature generation failing gradually. (c) In contrast, our ZeroDiff++ overcomes these shortcomings using three motivations in the training stage for seen correlation: diffusion-augmented infinite features, dynamic SC-based representations, and multi-view discriminating. (d) Two motivations in the generating stage are for unseen correlation: diffusion-based test-time adaptation and generating. Finally, substantial correlations on both seen and unseen classes allow ZeroDiff++ to keep a robust performance with even 10% training set.
Figure 3: The epoch-$\Delta^s_{adv}$ and epoch-$\Delta^u_{adv}$ curves of the classical f-VAEGAN xian2018feature. Larger $\Delta^s_{adv}$ indicates $D_{adv}$ thinks real testing seen examples are more fake, i.e., learns more spurious visual-semantic correlation on seen classes. Similarly, the epoch-$\Delta^u_{adv}$ curve reveals the learned unseen correlation of $D_{adv}$.
Figure 4: The training stage of our ZeroDiff++. ⓒ represents the concatenation operation. Given frozen extractors $F^{*}_{ce}$ and $F^{*}_{sc}$, we take them to extract clean visual features $\mathbf{v}_{0}$ and contrastive representations $\mathbf{r}_{0}$. Then, we use the diffusion forward chain (Eq. \ref{['eq:diff_forward_chain']}) to obtain real noised visual features $\mathbf{v}_{t-1}$ and $\mathbf{v}_{t}$. Next, $G$ in DFG denoises/generates a fake clean feature $\tilde{\mathbf{v}}_{0}$, conditioned by the concatenation of the semantic label $\mathbf{a}$, latent variable $\mathbf{z}$, diffusion time $t$, noised feature $\mathbf{v}_{t}$, and SC-based representation $\mathbf{r}_{0}$. The fake clean feature is evaluated from three different learning perspectives: adversarial learning (Does it match predefined semantics?), denoising learning ( Does it match diffusion processes?), and representation learning (Does it match contrastive representations?). Finally, we present the mutual learning loss $\mathcal{L}_{mu}$ to integrate knowledge of all discriminators.
Figure 5: The generating stage of our ZeroDiff++. In the generating stage, Zerodiff++ firstly trains a pre-classifier $F_{zsl}$ by unadaptive DFG. Next, it adapts unseen classes by the $\mathcal{L}_{utta}$ loss to minimize the denoising error between real test features $\mathbf{v}^{u}_{0}$ and fake test features $\tilde{\mathbf{v}}^{u}_{0}$ conditioned by the pseudo semantics from pre-classifier. Besides, it also uses the $\mathcal{L}_{stta}$ to prevent forgetting seen classes, leading to the adaptive DFG $G^{\dagger}$. Finally, it samples the partially-noised test features $\mathbf{v}^{u}_{t}$ from the diffusion forward chain, generates the partially-denoised test features $\tilde{\mathbf{v}}^{u}_{t}$, and training the final ZSL classifier $F^{\dagger}_{zsl}$.
...and 7 more figures

Theorems & Definitions (7)

Lemma 5.1: Kantorovich--Rubinstein duality
proof : Comment
Theorem 5.2: W-distance Contraction
proof
Theorem 5.3: Train-validation Generalization Error Contraction
proof
Remark 5.4

ZeroDiff++: Substantial Unseen Visual-semantic Correlation in Zero-shot Learning

TL;DR

Abstract

ZeroDiff++: Substantial Unseen Visual-semantic Correlation in Zero-shot Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (7)