Table of Contents
Fetching ...

Attribute Distribution Modeling and Semantic-Visual Alignment for Generative Zero-shot Learning

Haojie Pu, Zhuoming Li, Yongbiao Gao, Yuheng Jia

TL;DR

This work proposes an Attribute Distribution Modeling and Semantic-Visual Alignment (ADiVA) approach, jointly modeling attribute distributions and performing explicit semantic-visual alignment, which significantly outperforms state-of-the-art methods.

Abstract

Generative zero-shot learning (ZSL) synthesizes features for unseen classes, leveraging semantic conditions to transfer knowledge from seen classes. However, it also introduces two intrinsic challenges: (1) class-level attributes fails to capture instance-specific visual appearances due to substantial intra-class variability, thus causing the class-instance gap; (2) the substantial mismatch between semantic and visual feature distributions, manifested in inter-class correlations, gives rise to the semantic-visual domain gap. To address these challenges, we propose an Attribute Distribution Modeling and Semantic-Visual Alignment (ADiVA) approach, jointly modeling attribute distributions and performing explicit semantic-visual alignment. Specifically, our ADiVA consists of two modules: an Attribute Distribution Modeling (ADM) module that learns a transferable attribute distribution for each class and samples instance-level attributes for unseen classes, and a Visual-Guided Alignment (VGA) module that refines semantic representations to better reflect visual structures. Experiments on three widely used benchmark datasets demonstrate that ADiVA significantly outperforms state-of-the-art methods (e.g., achieving gains of 4.7% and 6.1% on AWA2 and SUN, respectively). Moreover, our approach can serve as a plugin to enhance existing generative ZSL methods.

Attribute Distribution Modeling and Semantic-Visual Alignment for Generative Zero-shot Learning

TL;DR

This work proposes an Attribute Distribution Modeling and Semantic-Visual Alignment (ADiVA) approach, jointly modeling attribute distributions and performing explicit semantic-visual alignment, which significantly outperforms state-of-the-art methods.

Abstract

Generative zero-shot learning (ZSL) synthesizes features for unseen classes, leveraging semantic conditions to transfer knowledge from seen classes. However, it also introduces two intrinsic challenges: (1) class-level attributes fails to capture instance-specific visual appearances due to substantial intra-class variability, thus causing the class-instance gap; (2) the substantial mismatch between semantic and visual feature distributions, manifested in inter-class correlations, gives rise to the semantic-visual domain gap. To address these challenges, we propose an Attribute Distribution Modeling and Semantic-Visual Alignment (ADiVA) approach, jointly modeling attribute distributions and performing explicit semantic-visual alignment. Specifically, our ADiVA consists of two modules: an Attribute Distribution Modeling (ADM) module that learns a transferable attribute distribution for each class and samples instance-level attributes for unseen classes, and a Visual-Guided Alignment (VGA) module that refines semantic representations to better reflect visual structures. Experiments on three widely used benchmark datasets demonstrate that ADiVA significantly outperforms state-of-the-art methods (e.g., achieving gains of 4.7% and 6.1% on AWA2 and SUN, respectively). Moreover, our approach can serve as a plugin to enhance existing generative ZSL methods.
Paper Structure (20 sections, 11 equations, 13 figures, 6 tables)

This paper contains 20 sections, 11 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Motivation and remedy. (a) Class--instance Gap. The class-level attribute fails to precisely describe different instances due to occlusion or intra-class ambiguity, leading to a class–instance gap. Moreover, existing works achieve semantic instantiation only on seen classes, failing on unseen classes. (b) Our Remedy for Class--instance Gap. Based on the observation that attribute distributions exhibit similar structural patterns across seen and unseen classes, the class-level attribute is used to encode an attribute distribution on seen classes, under the supervision of visually grounded attributes obtained via activation maps. The learned distributions can be transferred to unseen classes, from which instance-level attributes are obtained by sampling. (c) Semantic--visual Domain Gap. Classes with highly similar attributes (e.g., Class 1 and Class 2) can differ drastically in visual appearance, revealing a large semantic-–visual domain gap. Such a gap is also reflected in the inconsistency of inter-class relationships between the semantic and visual spaces, making cross-domain generation difficult. (d) Our Remedy for Semantic--visual Domain Gap. We address this by aligning the attributes from the semantic domain to visual domain before generation, thus preserve consistent correlation with visual domain.
  • Figure 2: Overview of the proposed ADiVA framework. Our ADiVA consists of two complementary modules: an Attribute Distribution Modeling (ADM) module and a Visual-Guided Alignment (VGA) module. Given an input image, the pretrained ViT extracts global features $x$ and local patches $F$, while GloVe provides semantic embeddings $S$. Within ADM, the Attribute Location Network (ALN) employs a semantic-guided attention mechanism to obtain visually grounded attributes $\bar{a}$, then the Attribute Distribution Encoder (ADE) models the attribute distribution $\boldsymbol{p}(a)\sim \mathcal{N}(a+\mu_a,\sigma_a^2)$, under the supervision of $\bar{a}$ to ensure visual alignment. VGA then leverages the sampled attributes $\hat{a}$ to align the semantic and visual spaces via the proposed Alignment Loss ($\mathcal{L}_{align}$), obtaining a visual prior $\tilde{x}$ that preserves inter-class visual relationships. Finally, $\hat{a}$ and $\tilde{x}$ are concatenated as the generator’s conditions to synthesize visual features.
  • Figure 3: Illustration of the Attribute Location Network (ALN). It learns to predict visually grounded attribute for a given image, which better reflect image's actual attribute state.
  • Figure 4: Distributions of instance-level attributes on seen and unseen classes.
  • Figure 5: Qualitative and quantitative evaluation with t-SNE visualization and FID score. Visual features from f-VAEGAN xian2019f and from ours ADiVA are shown. We use 10 colors to denote randomly selected 10 unseen classes from CUB. FID measures the discrepancy between the distribution of generated features and that of real features.
  • ...and 8 more figures