Table of Contents
Fetching ...

Robust Noisy Correspondence Learning via Self-Drop and Dual-Weight

Fan Liu, Chenwei Dong, Chuanyi Zhang, Hualiang Zhou, Jun Zhou

TL;DR

This work tackles noisy cross-modal correspondences in web-derived image-text data. It introduces qua-partitioning into four data types and a Self-Drop and Dual-Weight (SDD) framework to robustly fine-tune vision-language models, leveraging $S(I_i,T_i)$ similarities and the $L_{InfoNCE}$ objective. The method combines sample selection (self-drop) with a dual-weight scheme—confidence $w_{con}$ from a Gaussian Mixture Model and significance $w_{sig}$ from a memory-bank analysis—to suppress noise while valuing significant clean samples. Empirical results on Flickr30K, MS-COCO, and CC120K show that SDD consistently outperforms state-of-the-art methods under various simulated and real-world noise levels, with strong stability across noise ratios. The proposed approach offers practical robustness for large-scale, noisy cross-modal data and provides a blueprint for improving noise resilience in VLP fine-tuning.

Abstract

Many researchers collect data from the internet through crowd-sourcing or web crawling to alleviate the data-hungry challenge associated with cross-modal matching. Although such practice does not require expensive annotations, it inevitably introduces mismatched pairs and results in a noisy correspondence problem. Current approaches leverage the memorization effect of deep neural networks to distinguish noise and perform re-weighting. However, briefly lowering the weight of noisy pairs cannot eliminate the negative impact of noisy correspondence in the training process. In this paper, we propose a novel self-drop and dual-weight approach, which achieves elaborate data processing by qua-partitioning the data. Specifically, our approach partitions all data into four types: clean and significant, clean yet insignificant, vague, and noisy. We analyze the effect of noisy and clean data pairs and find that for vision-language pre-training models, a small number of clean samples is more valuable than a majority of noisy ones. Based on this observation, we employ self-drop to discard noisy samples to effectively mitigate the impact of noise. In addition, we adopt a dual-weight strategy to ensure that the model focuses more on significant samples while appropriately leveraging vague samples. Compared to the prior works, our approach is more robust and demonstrates relatively more stable performance on noisy datasets, especially under a high noise ratio. Extensive experiments on three widely used datasets, including Flickr30K, MS-COCO, and Conceptual Captions, validate the effectiveness of our approach.

Robust Noisy Correspondence Learning via Self-Drop and Dual-Weight

TL;DR

This work tackles noisy cross-modal correspondences in web-derived image-text data. It introduces qua-partitioning into four data types and a Self-Drop and Dual-Weight (SDD) framework to robustly fine-tune vision-language models, leveraging similarities and the objective. The method combines sample selection (self-drop) with a dual-weight scheme—confidence from a Gaussian Mixture Model and significance from a memory-bank analysis—to suppress noise while valuing significant clean samples. Empirical results on Flickr30K, MS-COCO, and CC120K show that SDD consistently outperforms state-of-the-art methods under various simulated and real-world noise levels, with strong stability across noise ratios. The proposed approach offers practical robustness for large-scale, noisy cross-modal data and provides a blueprint for improving noise resilience in VLP fine-tuning.

Abstract

Many researchers collect data from the internet through crowd-sourcing or web crawling to alleviate the data-hungry challenge associated with cross-modal matching. Although such practice does not require expensive annotations, it inevitably introduces mismatched pairs and results in a noisy correspondence problem. Current approaches leverage the memorization effect of deep neural networks to distinguish noise and perform re-weighting. However, briefly lowering the weight of noisy pairs cannot eliminate the negative impact of noisy correspondence in the training process. In this paper, we propose a novel self-drop and dual-weight approach, which achieves elaborate data processing by qua-partitioning the data. Specifically, our approach partitions all data into four types: clean and significant, clean yet insignificant, vague, and noisy. We analyze the effect of noisy and clean data pairs and find that for vision-language pre-training models, a small number of clean samples is more valuable than a majority of noisy ones. Based on this observation, we employ self-drop to discard noisy samples to effectively mitigate the impact of noise. In addition, we adopt a dual-weight strategy to ensure that the model focuses more on significant samples while appropriately leveraging vague samples. Compared to the prior works, our approach is more robust and demonstrates relatively more stable performance on noisy datasets, especially under a high noise ratio. Extensive experiments on three widely used datasets, including Flickr30K, MS-COCO, and Conceptual Captions, validate the effectiveness of our approach.

Paper Structure

This paper contains 33 sections, 12 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustrations of complex data distributions in the real world and our observations. (a) We partition all data into four types, namely, clean and significant (green), clean yet insignificant (light green), vague (grey), and noisy (orange). The previous methods of adopting bi-partition or tri-partition are insufficient to handle such complex data distributions. (b) Comparison of rSum of CLIP under different noise/drop ratios on Flickr30K and MS-COCO 1K. The performance of VLP is differently sensitive to noisy data and discarded data and therefore a small number of clean samples is more valuable than a majority of noisy ones for fine-tuning a VLP.
  • Figure 2: Overview of the proposed method. (1) Self-Drop: SSD computes the similarity for samples with noisy correspondence, then utilizes a threshold $\alpha$ to construct a partial dataset $D_{p}$ by dropping samples with low similarity. (2) Confidence Weight: GMM generates $w_{con}$ from similarity distribution for samples in the partial dataset. (3) Significance Weight: SDD creates a siamese model by copying the parameters $\Theta_b^t$ of the base model. The siamese model with parameters $\Theta_s^t$ utilizes a memory bank to evaluate the loss variation before ($l^{i}[t]$) and after training ($l^{i}[t+1]$) on the partial dataset to produce $w_{sig}$. (4) Robust Matching: Finally, the base model trains and updates parameters on the partial dataset utilizing $w_{con}$ and $w_{sig}$.
  • Figure 3: (a) Performance and variance $(var)$ of under different noise ratios on Flickr30K. (b) Performance and variance $(var)$ of under different noise ratios on MS-COCO 1K.
  • Figure 4: (a) Performance curves on the validation set of Flickr30K in the training process. (b) Performance curves on the validation set of MS-COCO 1K in the training process.
  • Figure 5: Illustration of similarity distributions of SDD and NPC at different training stages on the Flickr30K under 40% noise. Since SDD and NPC adopt the same CLIP ViT-B/32-based backbone, their similarity distributions are the same in the initial epoch (a). Although SDD inevitably fits the noise after training for 5 epochs, it still performs well in separating clean and noisy samples. On the contrary, NPC's separation of noise after 5 epochs (d) of training is even worse than that in epoch 1 (b).
  • ...and 3 more figures