Table of Contents
Fetching ...

CSCNET: Class-Specified Cascaded Network for Compositional Zero-Shot Learning

Yanyi Zhang, Qi Jia, Xin Fan, Yu Liu, Ran He

TL;DR

Compositional Zero-shot Learning (CZSL) requires recognizing novel attribute–object (A–O) compositions by leveraging known priors. CSCNet addresses this by introducing class-specified cascaded networks for A2O and O2A branches, a separate composition branch, and a Parametric Classifier (ParamCls) to learn optimal visual–semantic matching, thereby modeling contextual A–O dependencies more effectively. The framework achieves state-of-the-art results on MIT-States and C-GQA, with significant gains in AUC and strong performance on Seen/Unseen splits, while carefully handling unseen compositions in the composition branch. Inference combines primitive and composition cues via Score = $\beta\left(P(a|x)P(o|\bar{a},x) + P(o|x)P(a|\bar{o},x)\right) + (1-\beta)P(c|x)$, illustrating principled integration of contextual guidance into final predictions.

Abstract

Attribute and object (A-O) disentanglement is a fundamental and critical problem for Compositional Zero-shot Learning (CZSL), whose aim is to recognize novel A-O compositions based on foregone knowledge. Existing methods based on disentangled representation learning lose sight of the contextual dependency between the A-O primitive pairs. Inspired by this, we propose a novel A-O disentangled framework for CZSL, namely Class-specified Cascaded Network (CSCNet). The key insight is to firstly classify one primitive and then specifies the predicted class as a priori for guiding another primitive recognition in a cascaded fashion. To this end, CSCNet constructs Attribute-to-Object and Object-to-Attribute cascaded branches, in addition to a composition branch modeling the two primitives as a whole. Notably, we devise a parametric classifier (ParamCls) to improve the matching between visual and semantic embeddings. By improving the A-O disentanglement, our framework achieves superior results than previous competitive methods.

CSCNET: Class-Specified Cascaded Network for Compositional Zero-Shot Learning

TL;DR

Compositional Zero-shot Learning (CZSL) requires recognizing novel attribute–object (A–O) compositions by leveraging known priors. CSCNet addresses this by introducing class-specified cascaded networks for A2O and O2A branches, a separate composition branch, and a Parametric Classifier (ParamCls) to learn optimal visual–semantic matching, thereby modeling contextual A–O dependencies more effectively. The framework achieves state-of-the-art results on MIT-States and C-GQA, with significant gains in AUC and strong performance on Seen/Unseen splits, while carefully handling unseen compositions in the composition branch. Inference combines primitive and composition cues via Score = , illustrating principled integration of contextual guidance into final predictions.

Abstract

Attribute and object (A-O) disentanglement is a fundamental and critical problem for Compositional Zero-shot Learning (CZSL), whose aim is to recognize novel A-O compositions based on foregone knowledge. Existing methods based on disentangled representation learning lose sight of the contextual dependency between the A-O primitive pairs. Inspired by this, we propose a novel A-O disentangled framework for CZSL, namely Class-specified Cascaded Network (CSCNet). The key insight is to firstly classify one primitive and then specifies the predicted class as a priori for guiding another primitive recognition in a cascaded fashion. To this end, CSCNet constructs Attribute-to-Object and Object-to-Attribute cascaded branches, in addition to a composition branch modeling the two primitives as a whole. Notably, we devise a parametric classifier (ParamCls) to improve the matching between visual and semantic embeddings. By improving the A-O disentanglement, our framework achieves superior results than previous competitive methods.
Paper Structure (12 sections, 10 equations, 4 figures, 3 tables)

This paper contains 12 sections, 10 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Top: concept illustration of CZSL, where a novel composition "Ripe Apple" recomposes attribute and object primitives learned from known compositions "Ripe Banana" and "Green Apple". Bottom: comparing the A-O disentanglement between the baseline method and CSCNet we propose. Note that, we omit the composition branch for brevity.
  • Figure 2: Architecture of our Class-specified Cascaded Network (CSCNet) for CZSL.
  • Figure 3: Impact of $\beta$ on AUC accuracy for two datasets.
  • Figure 4: Qualitative Results on MIT-States (Top) and C-GQA (Bottom). We present Top-3 prediction candidates using CSCNet, where green indicates correct predictions and yellow indicates incorrect predictions.