Table of Contents
Fetching ...

Mutually Causal Semantic Distillation Network for Zero-Shot Learning

Shiming Chen, Shuhuang Chen, Guo-Sen Xie, Xinge You

Abstract

Zero-shot learning (ZSL) aims to recognize the unseen classes in the open-world guided by the side-information (e.g., attributes). Its key task is how to infer the latent semantic knowledge between visual and attribute features on seen classes, and thus conducting a desirable semantic knowledge transfer from seen classes to unseen ones. Prior works simply utilize unidirectional attention within a weakly-supervised manner to learn the spurious and limited latent semantic representations, which fail to effectively discover the intrinsic semantic knowledge (e.g., attribute semantic) between visual and attribute features. To solve the above challenges, we propose a mutually causal semantic distillation network (termed MSDN++) to distill the intrinsic and sufficient semantic representations for ZSL. MSDN++ consists of an attribute$\rightarrow$visual causal attention sub-net that learns attribute-based visual features, and a visual$\rightarrow$attribute causal attention sub-net that learns visual-based attribute features. The causal attentions encourages the two sub-nets to learn causal vision-attribute associations for representing reliable features with causal visual/attribute learning. With the guidance of semantic distillation loss, the two mutual attention sub-nets learn collaboratively and teach each other throughout the training process. Extensive experiments on three widely-used benchmark datasets (e.g., CUB, SUN, AWA2, and FLO) show that our MSDN++ yields significant improvements over the strong baselines, leading to new state-of-the-art performances.

Mutually Causal Semantic Distillation Network for Zero-Shot Learning

Abstract

Zero-shot learning (ZSL) aims to recognize the unseen classes in the open-world guided by the side-information (e.g., attributes). Its key task is how to infer the latent semantic knowledge between visual and attribute features on seen classes, and thus conducting a desirable semantic knowledge transfer from seen classes to unseen ones. Prior works simply utilize unidirectional attention within a weakly-supervised manner to learn the spurious and limited latent semantic representations, which fail to effectively discover the intrinsic semantic knowledge (e.g., attribute semantic) between visual and attribute features. To solve the above challenges, we propose a mutually causal semantic distillation network (termed MSDN++) to distill the intrinsic and sufficient semantic representations for ZSL. MSDN++ consists of an attributevisual causal attention sub-net that learns attribute-based visual features, and a visualattribute causal attention sub-net that learns visual-based attribute features. The causal attentions encourages the two sub-nets to learn causal vision-attribute associations for representing reliable features with causal visual/attribute learning. With the guidance of semantic distillation loss, the two mutual attention sub-nets learn collaboratively and teach each other throughout the training process. Extensive experiments on three widely-used benchmark datasets (e.g., CUB, SUN, AWA2, and FLO) show that our MSDN++ yields significant improvements over the strong baselines, leading to new state-of-the-art performances.
Paper Structure (21 sections, 19 equations, 9 figures, 4 tables)

This paper contains 21 sections, 19 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Four investigated ZSL paradigms. (a) Embedding-based method. (b) Generative method. (c) Common space learning method. (d) Ours proposed mutually causal semantic distillation network (MSDN++). The semantic space $\mathcal{S}$ is represented by the class semantic vector annotated by humans based on the attribute descriptions. The visual space $\mathcal{V}$ is learned by a network backbone (e.g., ResNet101 He2016DeepRL). The common space $\mathcal{O}$ is a shared latent space between visual mapping and semantic mapping. The attribute space $\mathcal{A}$ is learned by a language model (e.g., Glove Pennington2014GloveGV). Filled triangles, circles, squares and diamonds denote the sample features in $\mathcal{S}$, $\mathcal{V}$, $\mathcal{O}$ and $\mathcal{A}$, respectively.
  • Figure 2: The pipeline of MSDN++. MSDN++ consists of an attribute$\rightarrow$visual causal attention sub-net (AVCA) and visual$\rightarrow$attribute causal attention sub-net (VACA). AVCA learns the attribute-based visual features $F$ with attribute-based visual learning and causal visual learning, while VACA discovers the vision-based attribute features $S$ with vision-based attribute learning and causal attribute learning. Then, tow mapping functions $\mathcal{M}_1$ and $\mathcal{M}_2$ map the visual features and attribute features into semantic space as semantic representations $\psi(x)$ and $\Psi(x)$, respectively. A semantic distillation loss $\mathcal{L}_{distill}$ to match the probability estimates of the two sub-nets for semantic distillation (i.e., $p_1$ and $p_2$), enabling MSDN++ to learn intrinsic semantic knowledge. During inference, we fuse the predictions of the two sub-nets to take full use of the complementary semantic representations.
  • Figure 3: Part of samples on four challenge datasets, including three fine-grained datasets (i.e., (a) CUB Welinder2010CaltechUCSDB2, (b) SUN Patterson2012SUNAD, and FLO Nilsback2008AutomatedFC), and one coarse-grained dataset (i.e., AWA2 Xian2017ZeroShotLC). Each image is sampled from different classes. We find that the images of different classes in fine-grained dataset are very similar and not easy to be distinguished, while the images in the coarse-grained dataset are more easy to be recognized. For example, CUB includes similar bird images, but AWA2 consists of various animal images.
  • Figure 4: Visualization of attention maps learned by the first sub-nets of MSDN Chen2022MSDNMS and MSDN++ on CUB. We show the top-10 attention maps focused by models. The red boxes indicate MSDN learns the wrong attention maps that are irrelevant to the corresponding attributes.
  • Figure 5: Visualization of attention maps for the two mutual attention sub-nets (i.e, MSDN++(AVCA) and MSDN++(VACA)). Results show that our AVCA and VACA subnets can overally learn the accurate visual localizations, but they also learn few of falure cases.
  • ...and 4 more figures