Table of Contents
Fetching ...

ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning

Quanxing Zha, Xin Liu, Shu-Juan Peng, Yiu-ming Cheung, Xing Xu, Nannan Wang

TL;DR

This paper tackles noisy correspondence in cross-modal retrieval by introducing ReCon, a relation-consistency learning framework that jointly enforces cross-modal and intra-modal relation alignment to distinguish true matches from mismatches. It defines a dual-constraint objective including a cross-modal InfoNCE-like loss and an intra-modal relation consistency loss, and incorporates a data division strategy based on small-loss signals, Gaussian Mixture Models, and a true-positives identification mechanism using a proxy relation discrepancy. The method employs a warmup phase and partitions the data into clean, locally associated, and noisy sets, with distinct training objectives and pseudo-labeling for noisy samples, to minimize wrong supervisions. Experiments on Flickr30K, MS-COCO, and CC152K show that ReCon consistently outperforms SOTA baselines under both simulated and real-world noisy conditions and remains robust across varying noise levels, highlighting its practical impact for robust multimodal learning and retrieval in realistic noisy data settings.

Abstract

Can we accurately identify the true correspondences from multimodal datasets containing mismatched data pairs? Existing methods primarily emphasize the similarity matching between the representations of objects across modalities, potentially neglecting the crucial relation consistency within modalities that are particularly important for distinguishing the true and false correspondences. Such an omission often runs the risk of misidentifying negatives as positives, thus leading to unanticipated performance degradation. To address this problem, we propose a general Relation Consistency learning framework, namely ReCon, to accurately discriminate the true correspondences among the multimodal data and thus effectively mitigate the adverse impact caused by mismatches. Specifically, ReCon leverages a novel relation consistency learning to ensure the dual-alignment, respectively of, the cross-modal relation consistency between different modalities and the intra-modal relation consistency within modalities. Thanks to such dual constrains on relations, ReCon significantly enhances its effectiveness for true correspondence discrimination and therefore reliably filters out the mismatched pairs to mitigate the risks of wrong supervisions. Extensive experiments on three widely-used benchmark datasets, including Flickr30K, MS-COCO, and Conceptual Captions, are conducted to demonstrate the effectiveness and superiority of ReCon compared with other SOTAs. The code is available at: https://github.com/qxzha/ReCon.

ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning

TL;DR

This paper tackles noisy correspondence in cross-modal retrieval by introducing ReCon, a relation-consistency learning framework that jointly enforces cross-modal and intra-modal relation alignment to distinguish true matches from mismatches. It defines a dual-constraint objective including a cross-modal InfoNCE-like loss and an intra-modal relation consistency loss, and incorporates a data division strategy based on small-loss signals, Gaussian Mixture Models, and a true-positives identification mechanism using a proxy relation discrepancy. The method employs a warmup phase and partitions the data into clean, locally associated, and noisy sets, with distinct training objectives and pseudo-labeling for noisy samples, to minimize wrong supervisions. Experiments on Flickr30K, MS-COCO, and CC152K show that ReCon consistently outperforms SOTA baselines under both simulated and real-world noisy conditions and remains robust across varying noise levels, highlighting its practical impact for robust multimodal learning and retrieval in realistic noisy data settings.

Abstract

Can we accurately identify the true correspondences from multimodal datasets containing mismatched data pairs? Existing methods primarily emphasize the similarity matching between the representations of objects across modalities, potentially neglecting the crucial relation consistency within modalities that are particularly important for distinguishing the true and false correspondences. Such an omission often runs the risk of misidentifying negatives as positives, thus leading to unanticipated performance degradation. To address this problem, we propose a general Relation Consistency learning framework, namely ReCon, to accurately discriminate the true correspondences among the multimodal data and thus effectively mitigate the adverse impact caused by mismatches. Specifically, ReCon leverages a novel relation consistency learning to ensure the dual-alignment, respectively of, the cross-modal relation consistency between different modalities and the intra-modal relation consistency within modalities. Thanks to such dual constrains on relations, ReCon significantly enhances its effectiveness for true correspondence discrimination and therefore reliably filters out the mismatched pairs to mitigate the risks of wrong supervisions. Extensive experiments on three widely-used benchmark datasets, including Flickr30K, MS-COCO, and Conceptual Captions, are conducted to demonstrate the effectiveness and superiority of ReCon compared with other SOTAs. The code is available at: https://github.com/qxzha/ReCon.

Paper Structure

This paper contains 23 sections, 15 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of relation discrepancy. The relation-aware alignment correctly identifies mismatched pair as negatives, while relation-agnostic alignment fails to detect such inconsistency.
  • Figure 2: The schematic pipeline of the proposed ReCon learning framework.
  • Figure 3: Performance under different hyper-parameters of ReCon on Flickr30K with 40% NCs.
  • Figure 4: Examples of detected mismatched pairs on Flickr30K.