Table of Contents
Fetching ...

PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval

Qiang Zou, Shuli Cheng, Jiayi Chen

TL;DR

PromptHash tackles semantic truncation and modal heterogeneity in cross-modal hashing by introducing affinity-prompted learning, an adaptive gated state-space fusion, and a Prompt Affinity Contrastive Learning (PACL) framework. The method uses text affinity prompts to preserve foreground semantics under CLIP's context limits, fuses image and prompt-rich text via a gated State Space Model, and aligns modalities with global-local prompt contrast and affinity-aware losses. Empirical results on MIRFLICKR-25K, NUS-WIDE, and MS COCO show substantial improvements over state-of-the-art methods, including an ${18.22}\%$ (I2T) and ${18.65}\%$ (T2I) gain on NUS-WIDE, and strong gains on the other datasets. The work introduces a new paradigm for cross-modal hashing that emphasizes semantic consistency, efficient fusion, and foreground-background discrimination, with publicly available code to enable reproducibility.

Abstract

Cross-modal hashing is a promising approach for efficient data retrieval and storage optimization. However, contemporary methods exhibit significant limitations in semantic preservation, contextual integrity, and information redundancy, which constrains retrieval efficacy. We present PromptHash, an innovative framework leveraging affinity prompt-aware collaborative learning for adaptive cross-modal hashing. We propose an end-to-end framework for affinity-prompted collaborative hashing, with the following fundamental technical contributions: (i) a text affinity prompt learning mechanism that preserves contextual information while maintaining parameter efficiency, (ii) an adaptive gated selection fusion architecture that synthesizes State Space Model with Transformer network for precise cross-modal feature integration, and (iii) a prompt affinity alignment strategy that bridges modal heterogeneity through hierarchical contrastive learning. To the best of our knowledge, this study presents the first investigation into affinity prompt awareness within collaborative cross-modal adaptive hash learning, establishing a paradigm for enhanced semantic consistency across modalities. Through comprehensive evaluation on three benchmark multi-label datasets, PromptHash demonstrates substantial performance improvements over existing approaches. Notably, on the NUS-WIDE dataset, our method achieves significant gains of 18.22% and 18.65% in image-to-text and text-to-image retrieval tasks, respectively. The code is publicly available at https://github.com/ShiShuMo/PromptHash.

PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval

TL;DR

PromptHash tackles semantic truncation and modal heterogeneity in cross-modal hashing by introducing affinity-prompted learning, an adaptive gated state-space fusion, and a Prompt Affinity Contrastive Learning (PACL) framework. The method uses text affinity prompts to preserve foreground semantics under CLIP's context limits, fuses image and prompt-rich text via a gated State Space Model, and aligns modalities with global-local prompt contrast and affinity-aware losses. Empirical results on MIRFLICKR-25K, NUS-WIDE, and MS COCO show substantial improvements over state-of-the-art methods, including an (I2T) and (T2I) gain on NUS-WIDE, and strong gains on the other datasets. The work introduces a new paradigm for cross-modal hashing that emphasizes semantic consistency, efficient fusion, and foreground-background discrimination, with publicly available code to enable reproducibility.

Abstract

Cross-modal hashing is a promising approach for efficient data retrieval and storage optimization. However, contemporary methods exhibit significant limitations in semantic preservation, contextual integrity, and information redundancy, which constrains retrieval efficacy. We present PromptHash, an innovative framework leveraging affinity prompt-aware collaborative learning for adaptive cross-modal hashing. We propose an end-to-end framework for affinity-prompted collaborative hashing, with the following fundamental technical contributions: (i) a text affinity prompt learning mechanism that preserves contextual information while maintaining parameter efficiency, (ii) an adaptive gated selection fusion architecture that synthesizes State Space Model with Transformer network for precise cross-modal feature integration, and (iii) a prompt affinity alignment strategy that bridges modal heterogeneity through hierarchical contrastive learning. To the best of our knowledge, this study presents the first investigation into affinity prompt awareness within collaborative cross-modal adaptive hash learning, establishing a paradigm for enhanced semantic consistency across modalities. Through comprehensive evaluation on three benchmark multi-label datasets, PromptHash demonstrates substantial performance improvements over existing approaches. Notably, on the NUS-WIDE dataset, our method achieves significant gains of 18.22% and 18.65% in image-to-text and text-to-image retrieval tasks, respectively. The code is publicly available at https://github.com/ShiShuMo/PromptHash.

Paper Structure

This paper contains 23 sections, 18 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Compare with existing frameworks. (a) Previous methods used dual Transformers for cross modal hashing contrastive learning. (b) We compare and learn images and texts separately by setting learnable affinity prompts.
  • Figure 2: Overall Framework of PromptHash. Our framework consists of five key components: 1) Image and Text Encoders for modality-specific feature extraction; 2) Adaptive Gated State Selection and Fusion Module for feature filtering and cross-modal fusion between image features and hybrid prompt-enhanced textual features; 3) Text Affinity-Aware Prompting that dynamically learns and distinguishes retrieval-beneficial foreground information while optimizing textual feature representations through dynamic prompting mechanisms; 4) Cross-Modal Prompt Alignment Mechanism incorporating both global and local alignments, where global alignment facilitates image-to-text and image-to-prompt representation alignments with intra-class and inter-class affinity losses; 5) Hash Learning with quantization and reconstruction losses.
  • Figure 3: Ablation study results of six key hyperparameters ($\alpha$, $\beta$, $\gamma$, $\lambda$, $\mu$, $\nu$) evaluated on three benchmark datasets (MIRFLICKR-25K, NUS-WIDE, and MS COCO).
  • Figure 4: Precision-Recall curves of different hash code lengths (16, 32, and 64 bits) on three benchmark datasets: MIRFLICKR-25K, NUS-WIDE, and MS COCO.