Table of Contents
Fetching ...

Compositional Zero-Shot Learning: A Survey

Ans Munir, Faisal Z. Qureshi, Mohsen Ali, Muhammad Haris Khan

TL;DR

This survey analyzes Compositional Zero-Shot Learning (CZSL) through a disentanglement-based taxonomy, grouping methods into no disentanglement, textual, visual, and cross-modal categories. It reviews problem settings, datasets, evaluation protocols, and a broad spectrum of modeling strategies, including causal, multi-domain, partial supervision, and synthetic embeddings. The empirical analysis reveals that visual disentanglement with CLIP backbones currently yields strong performance, while cross-modal approaches, though promising, require further development to excel in open-world settings. The paper highlights open challenges—modeling contextuality, open-world scaling, unseen primitives, and efficient use of large multimodal models—and outlines a roadmap for advancing CZSL toward scalable and robust compositional reasoning.

Abstract

Compositional Zero-Shot Learning (CZSL) is a critical task in computer vision that enables models to recognize unseen combinations of known attributes and objects during inference, addressing the combinatorial challenge of requiring training data for every possible composition. This is particularly challenging because the visual appearance of primitives is highly contextual; for example, ``small'' cats appear visually distinct from ``older'' ones, and ``wet'' cars differ significantly from ``wet'' cats. Effectively modeling this contextuality and the inherent compositionality is crucial for robust compositional zero-shot recognition. This paper presents, to our knowledge, the first comprehensive survey specifically focused on Compositional Zero-Shot Learning. We systematically review the state-of-the-art CZSL methods, introducing a taxonomy grounded in disentanglement, with four families of approaches: no explicit disentanglement, textual disentanglement, visual disentanglement, and cross-modal disentanglement. We provide a detailed comparative analysis of these methods, highlighting their core advantages and limitations in different problem settings, such as closed-world and open-world CZSL. Finally, we identify the most significant open challenges and outline promising future research directions. This survey aims to serve as a foundational resource to guide and inspire further advancements in this fascinating and important field. Papers studied in this survey with their official code are available on our github: https://github.com/ans92/Compositional-Zero-Shot-Learning

Compositional Zero-Shot Learning: A Survey

TL;DR

This survey analyzes Compositional Zero-Shot Learning (CZSL) through a disentanglement-based taxonomy, grouping methods into no disentanglement, textual, visual, and cross-modal categories. It reviews problem settings, datasets, evaluation protocols, and a broad spectrum of modeling strategies, including causal, multi-domain, partial supervision, and synthetic embeddings. The empirical analysis reveals that visual disentanglement with CLIP backbones currently yields strong performance, while cross-modal approaches, though promising, require further development to excel in open-world settings. The paper highlights open challenges—modeling contextuality, open-world scaling, unseen primitives, and efficient use of large multimodal models—and outlines a roadmap for advancing CZSL toward scalable and robust compositional reasoning.

Abstract

Compositional Zero-Shot Learning (CZSL) is a critical task in computer vision that enables models to recognize unseen combinations of known attributes and objects during inference, addressing the combinatorial challenge of requiring training data for every possible composition. This is particularly challenging because the visual appearance of primitives is highly contextual; for example, ``small'' cats appear visually distinct from ``older'' ones, and ``wet'' cars differ significantly from ``wet'' cats. Effectively modeling this contextuality and the inherent compositionality is crucial for robust compositional zero-shot recognition. This paper presents, to our knowledge, the first comprehensive survey specifically focused on Compositional Zero-Shot Learning. We systematically review the state-of-the-art CZSL methods, introducing a taxonomy grounded in disentanglement, with four families of approaches: no explicit disentanglement, textual disentanglement, visual disentanglement, and cross-modal disentanglement. We provide a detailed comparative analysis of these methods, highlighting their core advantages and limitations in different problem settings, such as closed-world and open-world CZSL. Finally, we identify the most significant open challenges and outline promising future research directions. This survey aims to serve as a foundational resource to guide and inspire further advancements in this fascinating and important field. Papers studied in this survey with their official code are available on our github: https://github.com/ans92/Compositional-Zero-Shot-Learning

Paper Structure

This paper contains 48 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (a) Compositional Zero-Shot Learning concept diagram. Traditional Image Recognition Model is not able to understand and generalize concepts while Compositional Zero-Shot Learning models are able to generalize the attributes and objects to recognize unseen composition of seen concepts at test time. (b) Figure illustrates the importance of contextuality in CZSL. It shows how the attribute Small appears differently when paired with different objects (Small Plane vs. Small Cat), and how the object Car appears differently with different attributes (Wet Car vs. Broken Car).
  • Figure 2: Statistics of published work on CZSL problem in selected venues. (a) Yearly breakdown of papers dealing with CZSL. As illustrated by the graph, CZSL papers saw a significant surge in 2023 and 2024. Specifically, 2023 marked a 116% increase in publications over 2022, followed by a 46% rise in 2024 compared to 2023. It should be noted that the 2025 publications comprise only those released during the first half of the year. (b) CZSL papers published from 2017 to 2025 in different venues.
  • Figure 3: Difference between Closed-world CZSL and Open-world CZSL setting in Compositional Zero-Shot Learning. Closed-world setting is further divided into Only-unseen and Generalized closed-world. Compositions marked with exclamation marks indicate unfeasible compositions.
  • Figure 4: A Comprehensive Taxonomy of Compositional Zero-Shot Learning Methods. Our taxonomy organizes CZSL methods along two dimensions. At the first level, methods are grouped by their disentanglement strategy, meaning how they factorize primitive representations across modalities. No Explicit Disentanglement retains unified representations of whole composition without modular separation; Textual Feature Disentanglement isolates attribute and object semantics within the language space; Visual Feature Disentanglement explicitly separates attribute-related and object-related components in the visual feature space; and Cross-Modal (Hybrid) Disentanglement factorizes both modalities jointly, aligning visual and textual primitives in a structured way. At the second level, methods are further categorized by their approach to solve CZSL challenge. Representative papers are listed under each category.
  • Figure 5: Architectural Overview: No Explicit Disentanglement bypasses primitive separation and models entire attribute-object compositions holistically through unified embeddings or simple fusion mechanisms (dotted boxes and dotted lines in the figure). Textual Disentanglement operates in the language space (gray boxes on the right), learning independent embeddings of attributes and objects that can be semantically composed at inference time. Visual Disentanglement focuses on the visual space (gray boxes on the left), isolating attributes and objects into structured, discriminative representations that can be systematically recombined into unseen compositions. Cross-Modal Disentanglement integrates both vision and language, disentangling primitives within each modality and aligning them in a shared embedding space to leverage complementary cues for more robust generalization.
  • ...and 1 more figures