Contextual Interaction via Primitive-based Adversarial Training For Compositional Zero-shot Learning

Suyi Li; Chenyi Jiang; Shidong Wang; Yang Long; Zheng Zhang; Haofeng Zhang

Contextual Interaction via Primitive-based Adversarial Training For Compositional Zero-shot Learning

Suyi Li, Chenyi Jiang, Shidong Wang, Yang Long, Zheng Zhang, Haofeng Zhang

TL;DR

This work proposes a novel over-sampling strategy with object-similarity guidance to augment target compositional training data and performs detailed quantitative analysis and retrieval experiments on well-established datasets to validate the effectiveness of the proposed method.

Abstract

Compositional Zero-shot Learning (CZSL) aims to identify novel compositions via known attribute-object pairs. The primary challenge in CZSL tasks lies in the significant discrepancies introduced by the complex interaction between the visual primitives of attribute and object, consequently decreasing the classification performance towards novel compositions. Previous remarkable works primarily addressed this issue by focusing on disentangling strategy or utilizing object-based conditional probabilities to constrain the selection space of attributes. Unfortunately, few studies have explored the problem from the perspective of modeling the mechanism of visual primitive interactions. Inspired by the success of vanilla adversarial learning in Cross-Domain Few-Shot Learning, we take a step further and devise a model-agnostic and Primitive-Based Adversarial training (PBadv) method to deal with this problem. Besides, the latest studies highlight the weakness of the perception of hard compositions even under data-balanced conditions. To this end, we propose a novel over-sampling strategy with object-similarity guidance to augment target compositional training data. We performed detailed quantitative analysis and retrieval experiments on well-established datasets, such as UT-Zappos50K, MIT-States, and C-GQA, to validate the effectiveness of our proposed method, and the state-of-the-art (SOTA) performance demonstrates the superiority of our approach. The code is available at https://github.com/lisuyi/PBadv_czsl.

Contextual Interaction via Primitive-based Adversarial Training For Compositional Zero-shot Learning

TL;DR

Abstract

Paper Structure (17 sections, 19 equations, 6 figures, 5 tables)

This paper contains 17 sections, 19 equations, 6 figures, 5 tables.

Introduction
Related Work
Compositional Zero-shot Learning
Adversarial Attack
Our Approach
Task Formulation and Preliminary
Primitive-Based Adversarial Training
Object Similarity Based Oversampling Method
Training and Inference
Experiments
Experimental Setting
Quantitative Analysis
Effectiveness of PBadv
Ablation study
Hyper-Parameter Analysis
...and 2 more sections

Figures (6)

Figure 1: Motivation illustration. Previous state-of-the-art works (denoted by a yellow dashed border) lacked further exploration into the semantic interaction mechanism in the visual domain. To this end, we devise a perturbation method (denoted by a blue dashed border) named PBadv to model the mechanism of the complex interactions to allow the CZSL model to be robust to visually diverse compositions.
Figure 2: An Overview of PBadv. The framework of the classification network can be divided into two parallel parts: the visual primitive adversarial training branch (➀, ➁, ➃ and ➄) and the base branch (➂). Branch ➀ and ➁ are traditional disentangling architecture, which aims to decompose encoded visual features (i.e. , [CLS]) into corresponding attribute and object features (i.e. , $\boldsymbol{f}_a$ and $\boldsymbol{f}_o$). Next, a series of perturbed features will be synthesized by adding Gaussian noise and attacking the original features (our PBadv method, shown as (a)). After the compositional feature rebuilding process (➃ and ➄), the composed original and perturbed features will be projected into a common pair space for semantic alignment. (b): The object-similarity-guidance sampling probability $\mathcal{P}^{o}$ can be achieved by computing the similarity between all possible objects via prior knowledge (e.g. , GloVe). (c): In the OS-OSP method, the probability $\mathcal{P}^{o}$ will be referenced to guide the attribute selection of images during the oversampling process.
Figure 3: The influence of the temperature coefficient on UT-Zappos50K and C-GQA about the best AUC and HM.
Figure 4: The qualitative results of image-to-text retrieval experiment on the test set of MIT-States.
Figure 5: The qualitative results of text-to-image retrieval experiment on the test set of C-GQA.
...and 1 more figures

Contextual Interaction via Primitive-based Adversarial Training For Compositional Zero-shot Learning

TL;DR

Abstract

Contextual Interaction via Primitive-based Adversarial Training For Compositional Zero-shot Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)