Table of Contents
Fetching ...

Cross-modal Active Complementary Learning with Self-refining Correspondence

Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, Peng Hu

TL;DR

This work tackles noisy image-text correspondences (NC) in cross-modal matching by introducing Cross-modal Robust Complementary Learning (CRCL), which combines Active Complementary Loss (ACL) and Self-refining Correspondence Correction (SCC). ACL provides a noise-tolerant objective that blends active learning with robust complementary learning, while SCC progressively refines cross-modal correspondences using momentum-corrected predictions across multiple Self-Refining stages. The authors prove a robustness behavior as $q\to1$ and demonstrate empirically that CRCL yields state-of-the-art resilience to synthetic and real NC across Flickr30K, MS-COCO, and CC152K, with favorable results on well-annotated data and strong comparisons to CLIP baselines under noise. This approach enhances reliable cross-modal retrieval in practical settings with noisy annotations and offers a framework adaptable to existing image-text models.

Abstract

Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby inevitably leading to a performance drop. Although some methods attempt to address such noise, they still face two challenging problems: excessive memorizing/overfitting and unreliable correction for NC, especially under high noise. To address the two problems, we propose a generalized Cross-modal Robust Complementary Learning framework (CRCL), which benefits from a novel Active Complementary Loss (ACL) and an efficient Self-refining Correspondence Correction (SCC) to improve the robustness of existing methods. Specifically, ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision, leading to theoretically and experimentally demonstrated robustness against NC. SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences, thereby alleviating error accumulation and achieving accurate and stable corrections. We carry out extensive experiments on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences.

Cross-modal Active Complementary Learning with Self-refining Correspondence

TL;DR

This work tackles noisy image-text correspondences (NC) in cross-modal matching by introducing Cross-modal Robust Complementary Learning (CRCL), which combines Active Complementary Loss (ACL) and Self-refining Correspondence Correction (SCC). ACL provides a noise-tolerant objective that blends active learning with robust complementary learning, while SCC progressively refines cross-modal correspondences using momentum-corrected predictions across multiple Self-Refining stages. The authors prove a robustness behavior as and demonstrate empirically that CRCL yields state-of-the-art resilience to synthetic and real NC across Flickr30K, MS-COCO, and CC152K, with favorable results on well-annotated data and strong comparisons to CLIP baselines under noise. This approach enhances reliable cross-modal retrieval in practical settings with noisy annotations and offers a framework adaptable to existing image-text models.

Abstract

Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby inevitably leading to a performance drop. Although some methods attempt to address such noise, they still face two challenging problems: excessive memorizing/overfitting and unreliable correction for NC, especially under high noise. To address the two problems, we propose a generalized Cross-modal Robust Complementary Learning framework (CRCL), which benefits from a novel Active Complementary Loss (ACL) and an efficient Self-refining Correspondence Correction (SCC) to improve the robustness of existing methods. Specifically, ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision, leading to theoretically and experimentally demonstrated robustness against NC. SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences, thereby alleviating error accumulation and achieving accurate and stable corrections. We carry out extensive experiments on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences.
Paper Structure (31 sections, 2 theorems, 29 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 31 sections, 2 theorems, 29 equations, 4 figures, 7 tables, 1 algorithm.

Key Result

Lemma 1

In an instance-level cross-modal matching problem, under uniform NC with noise rate $\eta \leq \frac{N-1}{N}$, when $q=1$, $\mathcal{L}_r$ is noise tolerant.

Figures (4)

  • Figure 1: (a,d) The performance on Flickr30K and MS-COCO with varying noise rates; (b,c/e,f) The similarities and corrected correspondences of training pairs after learning without/with SCC.
  • Figure 2: The value of $C/C^\prime$ changes with $q$, wherein $N$ is 100 and $\eta$ is 0.2.
  • Figure 3: Parametric analysis on Flickr30K with 60% noise.
  • Figure 4: The performance of VSE$\infty$ with different loss functions.

Theorems & Definitions (6)

  • Definition 1
  • Lemma 1
  • proof
  • proof
  • Lemma 2
  • proof