Table of Contents
Fetching ...

Hybrid Discriminative Attribute-Object Embedding Network for Compositional Zero-Shot Learning

Yang Liu, Xinshuo Wang, Jiale Du, Xinbo Gao, Jungong Han

TL;DR

The paper tackles Compositional Zero-Shot Learning (CZSL) by addressing complex attribute–object interactions and long-tail data with a Hybrid Discriminative Attribute-Object Embedding (HDA-OE) network. It introduces Attribute-Driven Data Synthesis (ADDS) to diversify training attribute combinations and Subclass-Driven Discriminative Embedding (SDDE) to capture fine-grained subclass structure in embeddings, optimized via a combined objective $oldsymbol{L_{total}} = oldsymbol{\oldsymbol{\alpha}} oldsymbol{L_{base}} + oldsymbol{\boldsymbol{\beta}} oldsymbol{L_{emd}}$ and a joint feasibility score $C(a,o)$ in a shared space. The approach achieves state-of-the-art results on UT-Zappos, MIT-States, and C-GQA under both closed-world and open-world CZSL settings, with ablations confirming the contributions of ADDS and SDDE and their synergy. The work demonstrates improved generalization to unseen attribute–object compositions and robustness to data imbalance, supported by extensive experiments and qualitative retrieval analyses. Overall, HDA-OE offers a practical, scalable framework for reliable CZSL in diverse, real-world scenarios.

Abstract

Compositional Zero-Shot Learning (CZSL) recognizes new combinations by learning from known attribute-object pairs. However, the main challenge of this task lies in the complex interactions between attributes and object visual representations, which lead to significant differences in images. In addition, the long-tail label distribution in the real world makes the recognition task more complicated. To address these problems, we propose a novel method, named Hybrid Discriminative Attribute-Object Embedding (HDA-OE) network. To increase the variability of training data, HDA-OE introduces an attribute-driven data synthesis (ADDS) module. ADDS generates new samples with diverse attribute labels by combining multiple attributes of the same object. By expanding the attribute space in the dataset, the model is encouraged to learn and distinguish subtle differences between attributes. To further improve the discriminative ability of the model, HDA-OE introduces the subclass-driven discriminative embedding (SDDE) module, which enhances the subclass discriminative ability of the encoding by embedding subclass information in a fine-grained manner, helping to capture the complex dependencies between attributes and object visual features. The proposed model has been evaluated on three benchmark datasets, and the results verify its effectiveness and reliability.

Hybrid Discriminative Attribute-Object Embedding Network for Compositional Zero-Shot Learning

TL;DR

The paper tackles Compositional Zero-Shot Learning (CZSL) by addressing complex attribute–object interactions and long-tail data with a Hybrid Discriminative Attribute-Object Embedding (HDA-OE) network. It introduces Attribute-Driven Data Synthesis (ADDS) to diversify training attribute combinations and Subclass-Driven Discriminative Embedding (SDDE) to capture fine-grained subclass structure in embeddings, optimized via a combined objective and a joint feasibility score in a shared space. The approach achieves state-of-the-art results on UT-Zappos, MIT-States, and C-GQA under both closed-world and open-world CZSL settings, with ablations confirming the contributions of ADDS and SDDE and their synergy. The work demonstrates improved generalization to unseen attribute–object compositions and robustness to data imbalance, supported by extensive experiments and qualitative retrieval analyses. Overall, HDA-OE offers a practical, scalable framework for reliable CZSL in diverse, real-world scenarios.

Abstract

Compositional Zero-Shot Learning (CZSL) recognizes new combinations by learning from known attribute-object pairs. However, the main challenge of this task lies in the complex interactions between attributes and object visual representations, which lead to significant differences in images. In addition, the long-tail label distribution in the real world makes the recognition task more complicated. To address these problems, we propose a novel method, named Hybrid Discriminative Attribute-Object Embedding (HDA-OE) network. To increase the variability of training data, HDA-OE introduces an attribute-driven data synthesis (ADDS) module. ADDS generates new samples with diverse attribute labels by combining multiple attributes of the same object. By expanding the attribute space in the dataset, the model is encouraged to learn and distinguish subtle differences between attributes. To further improve the discriminative ability of the model, HDA-OE introduces the subclass-driven discriminative embedding (SDDE) module, which enhances the subclass discriminative ability of the encoding by embedding subclass information in a fine-grained manner, helping to capture the complex dependencies between attributes and object visual features. The proposed model has been evaluated on three benchmark datasets, and the results verify its effectiveness and reliability.

Paper Structure

This paper contains 23 sections, 12 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (a) Traditional methods: Recognize unseen images by learning from known combinations. (b) Our method: Image samples are expanded across multiple layers, after which our visual features are deconstructed and mapped into the corresponding spaces, ultimately converging to form category prototypes. These prototypes serve as a basis for reassembling and predicting new combinations.
  • Figure 2: An overview of the proposed approach. We generate the target database by Attribute-Driven Data Synthesis (ADDS) (as shown in (a)). Then, we decompose the encoded visual features (i.e., $f_{cls}$) into their corresponding attribute and object feature embeddings using a traditional disentanglement architecture. A series of target feature embeddings with enhanced discriminative power will be synthesized through Subclass-Driven Discriminative Embedding (SDDE) (as shown in (b)). Both the target feature embeddings and the original feature embeddings are projected into a shared space to achieve semantic alignment.
  • Figure 3: The impact of temperature parameter $\tau$ on the best AUC and HM on the C-GQA dataset in the open and closed world.
  • Figure 4: Qualitative Result. (a) Each image has a ground truth label (black text) and 5 retrieval results (colored text), where the green text is the correct prediction. (b) In the last row "Felt Slipper", the wrong image (red box) is "fleece Slippers".