Table of Contents
Fetching ...

Compositional Zero-Shot Learning with Contextualized Cues and Adaptive Contrastive Training

Yun Li, Zhe Liu, Lina Yao

TL;DR

This work tackles Compositional Zero-Shot Learning by addressing two core challenges: primitive understanding and correct attribute–object linkages. It introduces the ULAO framework, comprising the Understanding Attributes and Objects (UAO) module, which leverages sequential object-focused prediction and contextual object hints to refine attribute recognition, and the Linking Attributes and Objects (LAO) module, which learns robust linkages via a novel textual-hard-negative contrastive training with adaptive thresholds. The approach yields state-of-the-art results across MIT-States, UT-Zappos, and C-GQA in both Closed-World and Open-World settings, with ablations confirming the contributions of object-contextual cues and adaptive contrastive learning. Overall, ULAO demonstrates that separating primitive grounding from linkage learning and guiding negatives through UAO predictions significantly enhances compositional recognition in CZSL tasks.

Abstract

Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of seen attributes and objects. Current CLIP-based methods in CZSL, despite their advancements, often fail to effectively understand and link the attributes and objects due to inherent limitations in CLIP's pretraining mechanisms. To address these shortcomings, this paper introduces a novel framework, Understanding and Linking Attributes and Objects (ULAO) in CZSL, which comprises two innovative modules. The Understanding Attributes and Objects (UAO) module improves primitive understanding by sequential primitive prediction and leveraging recognized objects as contextual hints for attribute classification. Concurrently, the Linking Attributes and Objects (LAO) module improves the attribute-object linkage understanding through a new contrastive learning strategy that incorporates tailored hard negative generation and adaptive loss adjustments. We demonstrate our model's superiority by showcasing its state-of-the-art performance across three benchmark datasets in both Closed-World (CW) and Open-World (OW) scenarios.

Compositional Zero-Shot Learning with Contextualized Cues and Adaptive Contrastive Training

TL;DR

This work tackles Compositional Zero-Shot Learning by addressing two core challenges: primitive understanding and correct attribute–object linkages. It introduces the ULAO framework, comprising the Understanding Attributes and Objects (UAO) module, which leverages sequential object-focused prediction and contextual object hints to refine attribute recognition, and the Linking Attributes and Objects (LAO) module, which learns robust linkages via a novel textual-hard-negative contrastive training with adaptive thresholds. The approach yields state-of-the-art results across MIT-States, UT-Zappos, and C-GQA in both Closed-World and Open-World settings, with ablations confirming the contributions of object-contextual cues and adaptive contrastive learning. Overall, ULAO demonstrates that separating primitive grounding from linkage learning and guiding negatives through UAO predictions significantly enhances compositional recognition in CZSL tasks.

Abstract

Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of seen attributes and objects. Current CLIP-based methods in CZSL, despite their advancements, often fail to effectively understand and link the attributes and objects due to inherent limitations in CLIP's pretraining mechanisms. To address these shortcomings, this paper introduces a novel framework, Understanding and Linking Attributes and Objects (ULAO) in CZSL, which comprises two innovative modules. The Understanding Attributes and Objects (UAO) module improves primitive understanding by sequential primitive prediction and leveraging recognized objects as contextual hints for attribute classification. Concurrently, the Linking Attributes and Objects (LAO) module improves the attribute-object linkage understanding through a new contrastive learning strategy that incorporates tailored hard negative generation and adaptive loss adjustments. We demonstrate our model's superiority by showcasing its state-of-the-art performance across three benchmark datasets in both Closed-World (CW) and Open-World (OW) scenarios.

Paper Structure

This paper contains 11 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Failure cases of using CLIP-based models to solve CZSL and how our model fixes them by context-aware sequential learning and contrastive learning.
  • Figure 2: Model Overview. The proposed ULAO equips UAO and LAO to understand and link attributes and objects.
  • Figure 3: Parameter study on UAO.
  • Figure 4: Study on contrastive loss ratio $r_{c}$.
  • Figure 5: Qualitative results on different variants of ULAO.