Duplex: Dual Prototype Learning for Compositional Zero-Shot Learning
Zhong Peng, Yishi Xu, Gerong Wang, Wenchao Chen, Bo Chen, Jing Zhang
TL;DR
Duplex tackles compositional zero-shot learning by introducing dual prototypes: semantic prototypes derived from learnable prompts in a CLIP-text space and visual prototypes learned from disentangled state/object features. A graph-based visual prototype updater (GCN) refines the composition prototypes using related visual evidence, and a multi-path inference strategy combines semantic, visual, and object-state cues for robust prediction in both closed- and open-world CZSL. The method achieves state-of-the-art results on MIT-States, UT-Zappos, and CGQA, with substantial gains in both accuracy and generalization, and it demonstrates the complementary value of aligning textual and visual representation while explicitly modeling visual disentanglement. Overall, Duplex advances CZSL by effectively leveraging pre-trained vision-language models, prompt-based semantic alignment, and graph-based visual prototype updates to generalize to unseen state-object compositions in diverse settings.
Abstract
Compositional Zero-Shot Learning (CZSL) aims to enable models to recognize novel compositions of visual states and objects that were absent during training. Existing methods predominantly focus on learning semantic representations of seen compositions but often fail to disentangle the independent features of states and objects in images, thereby limiting their ability to generalize to unseen compositions. To address this challenge, we propose Duplex, a novel dual-prototype learning method that integrates semantic and visual prototypes through a carefully designed dual-branch architecture, enabling effective representation learning for compositional tasks. Duplex utilizes a Graph Neural Network (GNN) to adaptively update visual prototypes, capturing complex interactions between states and objects. Additionally, it leverages the strong visual-semantic alignment of pre-trained Vision-Language Models (VLMs) and employs a multi-path architecture combined with prompt engineering to align image and text representations, ensuring robust generalization. Extensive experiments on three benchmark datasets demonstrate that Duplex outperforms state-of-the-art methods in both closed-world and open-world settings.
