Table of Contents
Fetching ...

HOPE: A Memory-Based and Composition-Aware Framework for Zero-Shot Learning with Hopfield Network and Soft Mixture of Experts

Do Huu Dat, Po Yuan Mao, Tien Hoang Nguyen, Wray Buntine, Mohammed Bennamoun

TL;DR

CZSL seeks to recognize unseen state–object compositions by leveraging compositional primitives. HOPE integrates a Modern Hopfield Network memory for retrieval with a Soft Mixture of Experts and a Soft Prompt Module to assemble primitive representations into novel concepts, reinforced by contrastive and retrieval losses. The approach yields state-of-the-art results on MIT-States and UT-Zappos and provides detailed ablations on memory design, contrastive learning, and expert routing, illustrating the importance of memory quality and gating in open-world CZSL. This memory-based, composition-aware framework offers a scalable path toward robust CZSL in real-world, open-world settings.

Abstract

Compositional Zero-Shot Learning (CZSL) has emerged as an essential paradigm in machine learning, aiming to overcome the constraints of traditional zero-shot learning by incorporating compositional thinking into its methodology. Conventional zero-shot learning has difficulty managing unfamiliar combinations of seen and unseen classes because it depends on pre-defined class embeddings. In contrast, Compositional Zero-Shot Learning leverages the inherent hierarchies and structural connections among classes, creating new class representations by combining attributes, components, or other semantic elements. In our paper, we propose a novel framework that for the first time combines the Modern \underline{H}opfield Network with a Mixture \underline{o}f \underline{E}x\underline{p}erts (HOPE) to classify the compositions of previously unseen objects. Specifically, the Modern Hopfield Network creates a memory that stores label prototypes and identifies relevant labels for a given input image. Subsequently, the Mixture of Expert models integrates the image with the appropriate prototype to produce the final composition classification. Our approach achieves SOTA performance on several benchmarks, including MIT-States and UT-Zappos. We also examine how each component contributes to improved generalization.

HOPE: A Memory-Based and Composition-Aware Framework for Zero-Shot Learning with Hopfield Network and Soft Mixture of Experts

TL;DR

CZSL seeks to recognize unseen state–object compositions by leveraging compositional primitives. HOPE integrates a Modern Hopfield Network memory for retrieval with a Soft Mixture of Experts and a Soft Prompt Module to assemble primitive representations into novel concepts, reinforced by contrastive and retrieval losses. The approach yields state-of-the-art results on MIT-States and UT-Zappos and provides detailed ablations on memory design, contrastive learning, and expert routing, illustrating the importance of memory quality and gating in open-world CZSL. This memory-based, composition-aware framework offers a scalable path toward robust CZSL in real-world, open-world settings.

Abstract

Compositional Zero-Shot Learning (CZSL) has emerged as an essential paradigm in machine learning, aiming to overcome the constraints of traditional zero-shot learning by incorporating compositional thinking into its methodology. Conventional zero-shot learning has difficulty managing unfamiliar combinations of seen and unseen classes because it depends on pre-defined class embeddings. In contrast, Compositional Zero-Shot Learning leverages the inherent hierarchies and structural connections among classes, creating new class representations by combining attributes, components, or other semantic elements. In our paper, we propose a novel framework that for the first time combines the Modern \underline{H}opfield Network with a Mixture \underline{o}f \underline{E}x\underline{p}erts (HOPE) to classify the compositions of previously unseen objects. Specifically, the Modern Hopfield Network creates a memory that stores label prototypes and identifies relevant labels for a given input image. Subsequently, the Mixture of Expert models integrates the image with the appropriate prototype to produce the final composition classification. Our approach achieves SOTA performance on several benchmarks, including MIT-States and UT-Zappos. We also examine how each component contributes to improved generalization.
Paper Structure (19 sections, 10 equations, 4 figures, 6 tables)

This paper contains 19 sections, 10 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Given a new unseen input composition (wet cat), HOPE retrieves analogous visual and linguistic concept embeddings from its learnable memory (wet dog, wet clothes, dry cat, cute cat, etc) and based on the additional retrieved information to make the ultimate prediction.
  • Figure 2: Framework Illustration of HOPE. HOPE is a three-stage training model. (a) We train the soft prompt by calculating $L_{\text{spm}}$ between CLIP encoding. The soft prompt consists of trainable variable [V1][V2][V3] and the set of labels [cut][dog]. (b) Further, we forward the output of the image latent features to the Hopfield network to retrieve the nearest samples from each class. Then we introduce $L_{\text{InfoNCE}}$ and $L_r$ to optimize the visual memory. (c) Finally, Soft Mixture of Experts will aggregate the information and give the ultimate prediction. The prediction is trained by $L_{\text{dfm}}$ and $L_{\text{st+obj}}$.
  • Figure 3: The visualization of attribute memory embeddings in different memory configurations and datasets
  • Figure 4: Visualization of Soft Mixture of Experts. Each 'expert' focuses on distinct categories of data – Expert 3 on food items, Expert 4 on apparel, etc.– demonstrating the model's ability to assign and weigh inputs across different neural network sub-models for enhanced specialization and accuracy in classification tasks