Table of Contents
Fetching ...

REPAIR: Rank Correlation and Noisy Pair Half-replacing with Memory for Noisy Correspondence

Ruochen Zheng, Jiahao Hong, Changxin Gao, Nong Sang

TL;DR

This work tackles noisy correspondence in cross-modal matching by introducing REPAIR, a memory-bank–driven framework. It replaces reliance on potentially flawed similarity networks with memory-based rank correlation to estimate soft correspondence labels, and adds a Noisy Pair Half-Replacing strategy to salvage fully mismatched pairs by substituting one modality with a memory-informed surrogate. Empirical results on Flickr30K, MS-COCO, and CC152K demonstrate robust gains under synthetic and real noise, with competitive memory-time tradeoffs. The approach offers a general, pluggable solution to reduce error propagation and maximize data utilization in multimodal learning.

Abstract

The presence of noise in acquired data invariably leads to performance degradation in cross-modal matching. Unfortunately, obtaining precise annotations in the multimodal field is expensive, which has prompted some methods to tackle the mismatched data pair issue in cross-modal matching contexts, termed as noisy correspondence. However, most of these existing noisy correspondence methods exhibit the following limitations: a) the problem of self-reinforcing error accumulation, and b) improper handling of noisy data pair. To tackle the two problems, we propose a generalized framework termed as Rank corrElation and noisy Pair hAlf-replacing wIth memoRy (REPAIR), which benefits from maintaining a memory bank for features of matched pairs. Specifically, we calculate the distances between the features in the memory bank and those of the target pair for each respective modality, and use the rank correlation of these two sets of distances to estimate the soft correspondence label of the target pair. Estimating soft correspondence based on memory bank features rather than using a similarity network can avoid the accumulation of errors due to incorrect network identifications. For pairs that are completely mismatched, REPAIR searches the memory bank for the most matching feature to replace one feature of one modality, instead of using the original pair directly or merely discarding the mismatched pair. We conduct experiments on three cross-modal datasets, i.e., Flickr30K, MSCOCO, and CC152K, proving the effectiveness and robustness of our REPAIR on synthetic and real-world noise.

REPAIR: Rank Correlation and Noisy Pair Half-replacing with Memory for Noisy Correspondence

TL;DR

This work tackles noisy correspondence in cross-modal matching by introducing REPAIR, a memory-bank–driven framework. It replaces reliance on potentially flawed similarity networks with memory-based rank correlation to estimate soft correspondence labels, and adds a Noisy Pair Half-Replacing strategy to salvage fully mismatched pairs by substituting one modality with a memory-informed surrogate. Empirical results on Flickr30K, MS-COCO, and CC152K demonstrate robust gains under synthetic and real noise, with competitive memory-time tradeoffs. The approach offers a general, pluggable solution to reduce error propagation and maximize data utilization in multimodal learning.

Abstract

The presence of noise in acquired data invariably leads to performance degradation in cross-modal matching. Unfortunately, obtaining precise annotations in the multimodal field is expensive, which has prompted some methods to tackle the mismatched data pair issue in cross-modal matching contexts, termed as noisy correspondence. However, most of these existing noisy correspondence methods exhibit the following limitations: a) the problem of self-reinforcing error accumulation, and b) improper handling of noisy data pair. To tackle the two problems, we propose a generalized framework termed as Rank corrElation and noisy Pair hAlf-replacing wIth memoRy (REPAIR), which benefits from maintaining a memory bank for features of matched pairs. Specifically, we calculate the distances between the features in the memory bank and those of the target pair for each respective modality, and use the rank correlation of these two sets of distances to estimate the soft correspondence label of the target pair. Estimating soft correspondence based on memory bank features rather than using a similarity network can avoid the accumulation of errors due to incorrect network identifications. For pairs that are completely mismatched, REPAIR searches the memory bank for the most matching feature to replace one feature of one modality, instead of using the original pair directly or merely discarding the mismatched pair. We conduct experiments on three cross-modal datasets, i.e., Flickr30K, MSCOCO, and CC152K, proving the effectiveness and robustness of our REPAIR on synthetic and real-world noise.
Paper Structure (19 sections, 13 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 19 sections, 13 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: The issues with the existing methods. (a) The problem of self-reinforcing error accumulation. When a mismatched data pair is mistakenly predicted as high soft correspondence, which is then transformed into a stricter soft margin, it leads to the accumulation of errors. (b) Performance trend of the validation set during the NCRhuang2021learning training process. Notably, after the 20-th epoch, a loss associated with the noisy set is introduced, leading to a rapid decline in performance, especially in the setting of 0.6 noise rate.
  • Figure 2: (a) The training pipeline for REPAIR. Rank correlation and noisy pair half-replacing are abbreviated as RC and NPR, respectively.(b) Illustration of memory bank update for clean set and the rank correlation to obtain the soft correspondence label.
  • Figure 3: The illustration of the NPR, showing an example to replace the image modality.
  • Figure 4: The performance on Flickr30K with varying bank size. The experiments are conducted under 40% noise rate.
  • Figure 5: The accuracy, precision and recall on Flickr30K with different value of $\eta$. The experiments are conducted under 40% noise rate.
  • ...and 4 more figures