Table of Contents
Fetching ...

Unified Framework for Open-World Compositional Zero-shot Learning

Hirunima Jayasekara, Khoi Pham, Nirat Saini, Abhinav Shrivastava

TL;DR

Open-World Compositional Zero-shot Learning requires recognizing novel attribute-object compositions beyond the training data. The paper proposes a unified framework that strengthens cross-modal interactions by fusing image and language representations through a transformer, aided by a TopK Embedding Selection module and a Sparse Linear Compositor to enable efficient inference. It adopts a hybrid learning strategy that integrates both joint and independent composition learning. On standard OW-CZSL benchmarks, it achieves state-of-the-art performance on three datasets and surpasses Large Vision Language Models on two, illustrating strong generalization and practical efficiency.

Abstract

Open-World Compositional Zero-Shot Learning (OW-CZSL) addresses the challenge of recognizing novel compositions of known primitives and entities. Even though prior works utilize language knowledge for recognition, such approaches exhibit limited interactions between language-image modalities. Our approach primarily focuses on enhancing the inter-modality interactions through fostering richer interactions between image and textual data. Additionally, we introduce a novel module aimed at alleviating the computational burden associated with exhaustive exploration of all possible compositions during the inference stage. While previous methods exclusively learn compositions jointly or independently, we introduce an advanced hybrid procedure that leverages both learning mechanisms to generate final predictions. Our proposed model, achieves state-of-the-art in OW-CZSL in three datasets, while surpassing Large Vision Language Models (LLVM) in two datasets.

Unified Framework for Open-World Compositional Zero-shot Learning

TL;DR

Open-World Compositional Zero-shot Learning requires recognizing novel attribute-object compositions beyond the training data. The paper proposes a unified framework that strengthens cross-modal interactions by fusing image and language representations through a transformer, aided by a TopK Embedding Selection module and a Sparse Linear Compositor to enable efficient inference. It adopts a hybrid learning strategy that integrates both joint and independent composition learning. On standard OW-CZSL benchmarks, it achieves state-of-the-art performance on three datasets and surpasses Large Vision Language Models on two, illustrating strong generalization and practical efficiency.

Abstract

Open-World Compositional Zero-Shot Learning (OW-CZSL) addresses the challenge of recognizing novel compositions of known primitives and entities. Even though prior works utilize language knowledge for recognition, such approaches exhibit limited interactions between language-image modalities. Our approach primarily focuses on enhancing the inter-modality interactions through fostering richer interactions between image and textual data. Additionally, we introduce a novel module aimed at alleviating the computational burden associated with exhaustive exploration of all possible compositions during the inference stage. While previous methods exclusively learn compositions jointly or independently, we introduce an advanced hybrid procedure that leverages both learning mechanisms to generate final predictions. Our proposed model, achieves state-of-the-art in OW-CZSL in three datasets, while surpassing Large Vision Language Models (LLVM) in two datasets.

Paper Structure

This paper contains 10 sections, 1 figure.

Figures (1)

  • Figure 1: The overall architecture of the proposed method. Input embeddings to the transformer encoder are formed by concatenating image patch embeddings and text embeddings. TopK Embedding Selection Module effectively selects relevant text embeddings that align with the provided image via cross attention. Sparse Linear Compositor computes attribute and object predictions alongside a final prediction vector utilizing a sparse linear layer.