Compositional Zero-Shot Learning: A Survey

Ans Munir; Faisal Z. Qureshi; Mohsen Ali; Muhammad Haris Khan

Compositional Zero-Shot Learning: A Survey

Ans Munir, Faisal Z. Qureshi, Mohsen Ali, Muhammad Haris Khan

TL;DR

This survey analyzes Compositional Zero-Shot Learning (CZSL) through a disentanglement-based taxonomy, grouping methods into no disentanglement, textual, visual, and cross-modal categories. It reviews problem settings, datasets, evaluation protocols, and a broad spectrum of modeling strategies, including causal, multi-domain, partial supervision, and synthetic embeddings. The empirical analysis reveals that visual disentanglement with CLIP backbones currently yields strong performance, while cross-modal approaches, though promising, require further development to excel in open-world settings. The paper highlights open challenges—modeling contextuality, open-world scaling, unseen primitives, and efficient use of large multimodal models—and outlines a roadmap for advancing CZSL toward scalable and robust compositional reasoning.

Abstract

Compositional Zero-Shot Learning (CZSL) is a critical task in computer vision that enables models to recognize unseen combinations of known attributes and objects during inference, addressing the combinatorial challenge of requiring training data for every possible composition. This is particularly challenging because the visual appearance of primitives is highly contextual; for example, ``small'' cats appear visually distinct from ``older'' ones, and ``wet'' cars differ significantly from ``wet'' cats. Effectively modeling this contextuality and the inherent compositionality is crucial for robust compositional zero-shot recognition. This paper presents, to our knowledge, the first comprehensive survey specifically focused on Compositional Zero-Shot Learning. We systematically review the state-of-the-art CZSL methods, introducing a taxonomy grounded in disentanglement, with four families of approaches: no explicit disentanglement, textual disentanglement, visual disentanglement, and cross-modal disentanglement. We provide a detailed comparative analysis of these methods, highlighting their core advantages and limitations in different problem settings, such as closed-world and open-world CZSL. Finally, we identify the most significant open challenges and outline promising future research directions. This survey aims to serve as a foundational resource to guide and inspire further advancements in this fascinating and important field. Papers studied in this survey with their official code are available on our github: https://github.com/ans92/Compositional-Zero-Shot-Learning

Compositional Zero-Shot Learning: A Survey

TL;DR

Abstract

Compositional Zero-Shot Learning: A Survey

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)