Table of Contents
Fetching ...

Duplex: Dual Prototype Learning for Compositional Zero-Shot Learning

Zhong Peng, Yishi Xu, Gerong Wang, Wenchao Chen, Bo Chen, Jing Zhang

TL;DR

Duplex tackles compositional zero-shot learning by introducing dual prototypes: semantic prototypes derived from learnable prompts in a CLIP-text space and visual prototypes learned from disentangled state/object features. A graph-based visual prototype updater (GCN) refines the composition prototypes using related visual evidence, and a multi-path inference strategy combines semantic, visual, and object-state cues for robust prediction in both closed- and open-world CZSL. The method achieves state-of-the-art results on MIT-States, UT-Zappos, and CGQA, with substantial gains in both accuracy and generalization, and it demonstrates the complementary value of aligning textual and visual representation while explicitly modeling visual disentanglement. Overall, Duplex advances CZSL by effectively leveraging pre-trained vision-language models, prompt-based semantic alignment, and graph-based visual prototype updates to generalize to unseen state-object compositions in diverse settings.

Abstract

Compositional Zero-Shot Learning (CZSL) aims to enable models to recognize novel compositions of visual states and objects that were absent during training. Existing methods predominantly focus on learning semantic representations of seen compositions but often fail to disentangle the independent features of states and objects in images, thereby limiting their ability to generalize to unseen compositions. To address this challenge, we propose Duplex, a novel dual-prototype learning method that integrates semantic and visual prototypes through a carefully designed dual-branch architecture, enabling effective representation learning for compositional tasks. Duplex utilizes a Graph Neural Network (GNN) to adaptively update visual prototypes, capturing complex interactions between states and objects. Additionally, it leverages the strong visual-semantic alignment of pre-trained Vision-Language Models (VLMs) and employs a multi-path architecture combined with prompt engineering to align image and text representations, ensuring robust generalization. Extensive experiments on three benchmark datasets demonstrate that Duplex outperforms state-of-the-art methods in both closed-world and open-world settings.

Duplex: Dual Prototype Learning for Compositional Zero-Shot Learning

TL;DR

Duplex tackles compositional zero-shot learning by introducing dual prototypes: semantic prototypes derived from learnable prompts in a CLIP-text space and visual prototypes learned from disentangled state/object features. A graph-based visual prototype updater (GCN) refines the composition prototypes using related visual evidence, and a multi-path inference strategy combines semantic, visual, and object-state cues for robust prediction in both closed- and open-world CZSL. The method achieves state-of-the-art results on MIT-States, UT-Zappos, and CGQA, with substantial gains in both accuracy and generalization, and it demonstrates the complementary value of aligning textual and visual representation while explicitly modeling visual disentanglement. Overall, Duplex advances CZSL by effectively leveraging pre-trained vision-language models, prompt-based semantic alignment, and graph-based visual prototype updates to generalize to unseen state-object compositions in diverse settings.

Abstract

Compositional Zero-Shot Learning (CZSL) aims to enable models to recognize novel compositions of visual states and objects that were absent during training. Existing methods predominantly focus on learning semantic representations of seen compositions but often fail to disentangle the independent features of states and objects in images, thereby limiting their ability to generalize to unseen compositions. To address this challenge, we propose Duplex, a novel dual-prototype learning method that integrates semantic and visual prototypes through a carefully designed dual-branch architecture, enabling effective representation learning for compositional tasks. Duplex utilizes a Graph Neural Network (GNN) to adaptively update visual prototypes, capturing complex interactions between states and objects. Additionally, it leverages the strong visual-semantic alignment of pre-trained Vision-Language Models (VLMs) and employs a multi-path architecture combined with prompt engineering to align image and text representations, ensuring robust generalization. Extensive experiments on three benchmark datasets demonstrate that Duplex outperforms state-of-the-art methods in both closed-world and open-world settings.
Paper Structure (38 sections, 14 equations, 9 figures, 10 tables)

This paper contains 38 sections, 14 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Motivation of our method. Augmenting semantic prototypes with visual prototypes enables a more comprehensive representation of compositions.
  • Figure 2: Overview of our proposed Duplex. The Duplex framework consists of two parts: the semantic prototype module and the visual prototype module. The semantic prototype part is responsible for extracting linguistic features, while the visual prototype part is responsible for extracting the same state and object image features through seen compose categories and generalizing them to unseen compose categories.
  • Figure 3: Embedding visualization. Embedding visualization was conducted by selecting the first 300 classes from the MIT-States dataset for clustering, with colors representing different categories. Embedding visualization was performed separately for (a) semantic prototype embedding, (b) visual prototype embedding, and (c) the combined joint semantic and visual prototypes embedding.
  • Figure 4: Qualitative results. We randomly selected cases from MIT-States (the top row), UT-Zappos(the mid row) and CGQA (the bottom row). Each image has the ground-truth label (black text) and three predict results (colored text), in which the green text is the correct prediction.
  • Figure 5: Semantic and Visual Prototype Retrival. We conduct retrieval using semantic and visual prototypes extracted via Duplex on three datasets. In the first row, the incorrect result (highlighted in red) is 'barren road'. In the second row, the incorrect result is 'canvas loafers'. In the third row, the incorrect result is 'brown giraffe'
  • ...and 4 more figures