Table of Contents
Fetching ...

Attention Based Simple Primitives for Open World Compositional Zero-Shot Learning

Ans Munir, Faisal Z. Qureshi, Muhammad Haris Khan, Mohsen Ali

TL;DR

The paper tackles open-world compositional zero-shot learning by predicting attribute and object primitives independently while explicitly modeling their interactions with self-attention. It projects image and textual representations into separate attribute and object spaces and scores compositions via a multiplicative cosine-based fusion, augmented by ConceptNet-based feasibility to prune infeasible pairs. Empirical results on MIT-States, UT-Zappos, and CGQA demonstrate competitive or state-of-the-art performance, with notable gains on CGQA and robust open-world generalization. The work highlights the value of primitive-level attention and knowledge-guided feasibility for scalable, contextualized visual reasoning in zero-shot settings.

Abstract

Compositional Zero-Shot Learning (CZSL) aims to predict unknown compositions made up of attribute and object pairs. Predicting compositions unseen during training is a challenging task. We are exploring Open World Compositional Zero-Shot Learning (OW-CZSL) in this study, where our test space encompasses all potential combinations of attributes and objects. Our approach involves utilizing the self-attention mechanism between attributes and objects to achieve better generalization from seen to unseen compositions. Utilizing a self-attention mechanism facilitates the model's ability to identify relationships between attribute and objects. The similarity between the self-attended textual and visual features is subsequently calculated to generate predictions during the inference phase. The potential test space may encompass implausible object-attribute combinations arising from unrestricted attribute-object pairings. To mitigate this issue, we leverage external knowledge from ConceptNet to restrict the test space to realistic compositions. Our proposed model, Attention-based Simple Primitives (ASP), demonstrates competitive performance, achieving results comparable to the state-of-the-art.

Attention Based Simple Primitives for Open World Compositional Zero-Shot Learning

TL;DR

The paper tackles open-world compositional zero-shot learning by predicting attribute and object primitives independently while explicitly modeling their interactions with self-attention. It projects image and textual representations into separate attribute and object spaces and scores compositions via a multiplicative cosine-based fusion, augmented by ConceptNet-based feasibility to prune infeasible pairs. Empirical results on MIT-States, UT-Zappos, and CGQA demonstrate competitive or state-of-the-art performance, with notable gains on CGQA and robust open-world generalization. The work highlights the value of primitive-level attention and knowledge-guided feasibility for scalable, contextualized visual reasoning in zero-shot settings.

Abstract

Compositional Zero-Shot Learning (CZSL) aims to predict unknown compositions made up of attribute and object pairs. Predicting compositions unseen during training is a challenging task. We are exploring Open World Compositional Zero-Shot Learning (OW-CZSL) in this study, where our test space encompasses all potential combinations of attributes and objects. Our approach involves utilizing the self-attention mechanism between attributes and objects to achieve better generalization from seen to unseen compositions. Utilizing a self-attention mechanism facilitates the model's ability to identify relationships between attribute and objects. The similarity between the self-attended textual and visual features is subsequently calculated to generate predictions during the inference phase. The potential test space may encompass implausible object-attribute combinations arising from unrestricted attribute-object pairings. To mitigate this issue, we leverage external knowledge from ConceptNet to restrict the test space to realistic compositions. Our proposed model, Attention-based Simple Primitives (ASP), demonstrates competitive performance, achieving results comparable to the state-of-the-art.
Paper Structure (14 sections, 5 equations, 6 figures, 3 tables)

This paper contains 14 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: In Compositional Zero-Shot Learning we have training set in the form of compositions that consist of attributes like “Red, White” and objects like “Car, Cake”. Traditional image recognition models typically can only predict known compositions, but they struggle to compose new compositions during testing. In contrast, compositional zero-shot learning model effectively composes new compositions during testing, as evidenced by the example of the "Red Cake".
  • Figure 2: Example demonstrating the difference between close world and open world setting.
  • Figure 3: Demonstration of Visual Diversity in Primitives. The attribute Old looks drastically different in the context of Elephant (animate object) vs Car (inanimate object). Similarly the attribute Wet looks drastically different in the context of Cat (animate object) vs Ground (inanimate object)
  • Figure 4: Overall architecture of our ASP model. After concatenating attribute and object features, we compute self-attention between attributes and objects to obtain the interactions between them. Then we get attribute as well as object features and project them to Attribute Space (AS) and Object Space (OS) respectively through respective MLPs. Next, we compute the cosine similarity between attribute image features and composition attribute features. Similarly, we obtain cosine similarity between object image features and composition object features.
  • Figure 5: Graph showing the effect of heads in Multihead Attention. On x-axis, there is the number of heads while on the y-axis there is the Harmonic mean.
  • ...and 1 more figures