Table of Contents
Fetching ...

Robust Self-Paced Hashing for Cross-Modal Retrieval with Noisy Labels

Ruitao Pu, Yuan Sun, Yang Qin, Zhenwen Ren, Xiaomin Song, Huiming Zheng, Dezhong Peng

TL;DR

This paper tackles cross-modal retrieval with noisy labels by proposing Robust Self-paced Hashing with Noisy Labels (RSHNL), which integrates three components: Contrastive Hashing Learning (CHL) to tighten cross-modal consistency, Center Aggregation Learning (CAL) to unify class-level hash centers and reduce intra-class variation, and Noise-tolerance Self-paced Hashing (NSH) to dynamically identify and downweight mislabeled pairs while training from easy to hard. The approach formulates a joint objective that alternates between center-based regularization and cross-modal alignment, guided by a self-paced regularizer that yields weights $w_i=\max(0,1-\ell_i/\gamma)$. Theoretical analysis explains how NSH separates clean from noisy data via a tunable pace parameter $\gamma$, and experiments across four large datasets show RSHNL outperforms 11 baselines under varying noise rates and hash lengths, indicating strong robustness and practical value. Overall, RSHNL offers a principled and effective framework for robust cross-modal hashing in the presence of noisy supervision.

Abstract

Cross-modal hashing (CMH) has appeared as a popular technique for cross-modal retrieval due to its low storage cost and high computational efficiency in large-scale data. Most existing methods implicitly assume that multi-modal data is correctly labeled, which is expensive and even unattainable due to the inevitable imperfect annotations (i.e., noisy labels) in real-world scenarios. Inspired by human cognitive learning, a few methods introduce self-paced learning (SPL) to gradually train the model from easy to hard samples, which is often used to mitigate the effects of feature noise or outliers. It is a less-touched problem that how to utilize SPL to alleviate the misleading of noisy labels on the hash model. To tackle this problem, we propose a new cognitive cross-modal retrieval method called Robust Self-paced Hashing with Noisy Labels (RSHNL), which can mimic the human cognitive process to identify the noise while embracing robustness against noisy labels. Specifically, we first propose a contrastive hashing learning (CHL) scheme to improve multi-modal consistency, thereby reducing the inherent semantic gap. Afterward, we propose center aggregation learning (CAL) to mitigate the intra-class variations. Finally, we propose Noise-tolerance Self-paced Hashing (NSH) that dynamically estimates the learning difficulty for each instance and distinguishes noisy labels through the difficulty level. For all estimated clean pairs, we further adopt a self-paced regularizer to gradually learn hash codes from easy to hard. Extensive experiments demonstrate that the proposed RSHNL performs remarkably well over the state-of-the-art CMH methods.

Robust Self-Paced Hashing for Cross-Modal Retrieval with Noisy Labels

TL;DR

This paper tackles cross-modal retrieval with noisy labels by proposing Robust Self-paced Hashing with Noisy Labels (RSHNL), which integrates three components: Contrastive Hashing Learning (CHL) to tighten cross-modal consistency, Center Aggregation Learning (CAL) to unify class-level hash centers and reduce intra-class variation, and Noise-tolerance Self-paced Hashing (NSH) to dynamically identify and downweight mislabeled pairs while training from easy to hard. The approach formulates a joint objective that alternates between center-based regularization and cross-modal alignment, guided by a self-paced regularizer that yields weights . Theoretical analysis explains how NSH separates clean from noisy data via a tunable pace parameter , and experiments across four large datasets show RSHNL outperforms 11 baselines under varying noise rates and hash lengths, indicating strong robustness and practical value. Overall, RSHNL offers a principled and effective framework for robust cross-modal hashing in the presence of noisy supervision.

Abstract

Cross-modal hashing (CMH) has appeared as a popular technique for cross-modal retrieval due to its low storage cost and high computational efficiency in large-scale data. Most existing methods implicitly assume that multi-modal data is correctly labeled, which is expensive and even unattainable due to the inevitable imperfect annotations (i.e., noisy labels) in real-world scenarios. Inspired by human cognitive learning, a few methods introduce self-paced learning (SPL) to gradually train the model from easy to hard samples, which is often used to mitigate the effects of feature noise or outliers. It is a less-touched problem that how to utilize SPL to alleviate the misleading of noisy labels on the hash model. To tackle this problem, we propose a new cognitive cross-modal retrieval method called Robust Self-paced Hashing with Noisy Labels (RSHNL), which can mimic the human cognitive process to identify the noise while embracing robustness against noisy labels. Specifically, we first propose a contrastive hashing learning (CHL) scheme to improve multi-modal consistency, thereby reducing the inherent semantic gap. Afterward, we propose center aggregation learning (CAL) to mitigate the intra-class variations. Finally, we propose Noise-tolerance Self-paced Hashing (NSH) that dynamically estimates the learning difficulty for each instance and distinguishes noisy labels through the difficulty level. For all estimated clean pairs, we further adopt a self-paced regularizer to gradually learn hash codes from easy to hard. Extensive experiments demonstrate that the proposed RSHNL performs remarkably well over the state-of-the-art CMH methods.
Paper Structure (20 sections, 14 equations, 4 figures, 4 tables)

This paper contains 20 sections, 14 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The framework of our RSHNL. Blue and red represent hash codes of different modalities. Triangles and rectangles represent different categories. And the doji represents hash centers. Specifically, CHL maximizes the consistency of multi-modal data to alleviate the cross-modal gap. CAL develops a unified hash code for each class as a center and encourages the compactness of intra-class hash codes towards their corresponding hash centers. NSH dynamically distinguishes noisy labels based on their assessment difficulty while facilitating learning hash codes from easy to hard for clean pairs.
  • Figure 2: Experimental results with 128 bits on the INRIA-Websearch dataset under 0.6 noise rate.
  • Figure 3: The density versus the weight of all instances with 128 bits and 0.6 noise rate.
  • Figure 4: The average MAP scores versus epochs.