Robust Noisy Correspondence Learning via Self-Drop and Dual-Weight

Fan Liu; Chenwei Dong; Chuanyi Zhang; Hualiang Zhou; Jun Zhou

Robust Noisy Correspondence Learning via Self-Drop and Dual-Weight

Fan Liu, Chenwei Dong, Chuanyi Zhang, Hualiang Zhou, Jun Zhou

TL;DR

This work tackles noisy cross-modal correspondences in web-derived image-text data. It introduces qua-partitioning into four data types and a Self-Drop and Dual-Weight (SDD) framework to robustly fine-tune vision-language models, leveraging $S(I_i,T_i)$ similarities and the $L_{InfoNCE}$ objective. The method combines sample selection (self-drop) with a dual-weight scheme—confidence $w_{con}$ from a Gaussian Mixture Model and significance $w_{sig}$ from a memory-bank analysis—to suppress noise while valuing significant clean samples. Empirical results on Flickr30K, MS-COCO, and CC120K show that SDD consistently outperforms state-of-the-art methods under various simulated and real-world noise levels, with strong stability across noise ratios. The proposed approach offers practical robustness for large-scale, noisy cross-modal data and provides a blueprint for improving noise resilience in VLP fine-tuning.

Abstract

Many researchers collect data from the internet through crowd-sourcing or web crawling to alleviate the data-hungry challenge associated with cross-modal matching. Although such practice does not require expensive annotations, it inevitably introduces mismatched pairs and results in a noisy correspondence problem. Current approaches leverage the memorization effect of deep neural networks to distinguish noise and perform re-weighting. However, briefly lowering the weight of noisy pairs cannot eliminate the negative impact of noisy correspondence in the training process. In this paper, we propose a novel self-drop and dual-weight approach, which achieves elaborate data processing by qua-partitioning the data. Specifically, our approach partitions all data into four types: clean and significant, clean yet insignificant, vague, and noisy. We analyze the effect of noisy and clean data pairs and find that for vision-language pre-training models, a small number of clean samples is more valuable than a majority of noisy ones. Based on this observation, we employ self-drop to discard noisy samples to effectively mitigate the impact of noise. In addition, we adopt a dual-weight strategy to ensure that the model focuses more on significant samples while appropriately leveraging vague samples. Compared to the prior works, our approach is more robust and demonstrates relatively more stable performance on noisy datasets, especially under a high noise ratio. Extensive experiments on three widely used datasets, including Flickr30K, MS-COCO, and Conceptual Captions, validate the effectiveness of our approach.

Robust Noisy Correspondence Learning via Self-Drop and Dual-Weight

TL;DR

Abstract

Robust Noisy Correspondence Learning via Self-Drop and Dual-Weight

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)