Table of Contents
Fetching ...

Semantic-Consistent Bidirectional Contrastive Hashing for Noisy Multi-Label Cross-Modal Retrieval

Likang Peng, Chao Su, Wenyuan Wu, Yuan Sun, Dezhong Peng, Xi Peng, Xu Wang

TL;DR

This work tackles cross-modal retrieval under noisy multi-label supervision by introducing Semantic-Consistent Bidirectional Contrastive Hashing (SCBCH). SCBCH combines Cross-modal Semantic-Consistent Classification ($L_{cscc}$), which adaptively reweights supervision based on cross-modal neighbor semantics, with Bidirectional Soft Contrastive Hashing ($L_{bsch}$), which uses soft contrastive pairs derived from partial label overlap via a bidirectional strategy. Training proceeds with an epoch-dependent schedule that first emphasizes standard classification losses and then incorporates $L_{cscc}$ alongside $L_{bsch}$ for robust supervision. Across four benchmarks and multiple noise levels, SCBCH consistently outperforms 12 state-of-the-art baselines, demonstrating improved robustness and accuracy in noisy multi-label cross-modal retrieval.

Abstract

Cross-modal hashing (CMH) facilitates efficient retrieval across different modalities (e.g., image and text) by encoding data into compact binary representations. While recent methods have achieved remarkable performance, they often rely heavily on fully annotated datasets, which are costly and labor-intensive to obtain. In real-world scenarios, particularly in multi-label datasets, label noise is prevalent and severely degrades retrieval performance. Moreover, existing CMH approaches typically overlook the partial semantic overlaps inherent in multi-label data, limiting their robustness and generalization. To tackle these challenges, we propose a novel framework named Semantic-Consistent Bidirectional Contrastive Hashing (SCBCH). The framework comprises two complementary modules: (1) Cross-modal Semantic-Consistent Classification (CSCC), which leverages cross-modal semantic consistency to estimate sample reliability and reduce the impact of noisy labels; (2) Bidirectional Soft Contrastive Hashing (BSCH), which dynamically generates soft contrastive sample pairs based on multi-label semantic overlap, enabling adaptive contrastive learning between semantically similar and dissimilar samples across modalities. Extensive experiments on four widely-used cross-modal retrieval benchmarks validate the effectiveness and robustness of our method, consistently outperforming state-of-the-art approaches under noisy multi-label conditions.

Semantic-Consistent Bidirectional Contrastive Hashing for Noisy Multi-Label Cross-Modal Retrieval

TL;DR

This work tackles cross-modal retrieval under noisy multi-label supervision by introducing Semantic-Consistent Bidirectional Contrastive Hashing (SCBCH). SCBCH combines Cross-modal Semantic-Consistent Classification (), which adaptively reweights supervision based on cross-modal neighbor semantics, with Bidirectional Soft Contrastive Hashing (), which uses soft contrastive pairs derived from partial label overlap via a bidirectional strategy. Training proceeds with an epoch-dependent schedule that first emphasizes standard classification losses and then incorporates alongside for robust supervision. Across four benchmarks and multiple noise levels, SCBCH consistently outperforms 12 state-of-the-art baselines, demonstrating improved robustness and accuracy in noisy multi-label cross-modal retrieval.

Abstract

Cross-modal hashing (CMH) facilitates efficient retrieval across different modalities (e.g., image and text) by encoding data into compact binary representations. While recent methods have achieved remarkable performance, they often rely heavily on fully annotated datasets, which are costly and labor-intensive to obtain. In real-world scenarios, particularly in multi-label datasets, label noise is prevalent and severely degrades retrieval performance. Moreover, existing CMH approaches typically overlook the partial semantic overlaps inherent in multi-label data, limiting their robustness and generalization. To tackle these challenges, we propose a novel framework named Semantic-Consistent Bidirectional Contrastive Hashing (SCBCH). The framework comprises two complementary modules: (1) Cross-modal Semantic-Consistent Classification (CSCC), which leverages cross-modal semantic consistency to estimate sample reliability and reduce the impact of noisy labels; (2) Bidirectional Soft Contrastive Hashing (BSCH), which dynamically generates soft contrastive sample pairs based on multi-label semantic overlap, enabling adaptive contrastive learning between semantically similar and dissimilar samples across modalities. Extensive experiments on four widely-used cross-modal retrieval benchmarks validate the effectiveness and robustness of our method, consistently outperforming state-of-the-art approaches under noisy multi-label conditions.

Paper Structure

This paper contains 21 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Cross-modal (image-text) pair with noisy multi-label annotations, where green indicates correct labels and red indicates incorrect ones.
  • Figure 2: The SCBCH framework for cross-modal hashing under noisy multi-label supervision consists of CSCC and BSCH. CSCC enhances label reliability by leveraging cross-modal neighbor consistency to adaptively reweight samples, while BSCH employs bidirectional contrastive learning to construct reliable soft pairs based on multi-label overlap, explicitly attracting similar pairs and repelling dissimilar ones to improve robustness against noise.
  • Figure 3: Illustration of three label-based contrastive pairing strategies in multi-label learning. Each row represents the multi-hot label vector of a sample with the first row corresponding to the anchor sample. Blue circle indicates a label is present with a value of one. Green boxes denote positive pairs with the anchor, orange boxes denote negative pairs, and half-green half-orange boxes represent soft pairs. Taking the last sample as an example, the ALL strategy constructs it as a negative pair, the ANY strategy regards it as a positive pair, while the Bidirectional strategy treats it as our soft pair due to partial label overlap.
  • Figure 4: Precision-recall curves on four datasets under 64-bit hash codes and a 0.8 noise rate.
  • Figure 5: Heatmaps of label similarities and cross-modal feature similarities among 16 randomly selected samples on the MIRFlickr-25K dataset under a 50% noise rate. Subfigure (a) shows the label similarity matrix. Subfigures (b) and (c) illustrate the evolution of feature similarities from epoch 1 to epoch 50.
  • ...and 1 more figures