Table of Contents
Fetching ...

Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning

Zihua Zhao, Mengxi Chen, Tianjie Dai, Jiangchao Yao, Bo han, Ya Zhang, Yanfeng Wang

TL;DR

This work tackles noisy cross-modal correspondence that distorts both cross-modal and intra-modal geometrical structures. It introduces Geometrical Structure Consistency (GSC), a method that simultaneously preserves cross-modal similarities and intra-modal structures, using two noise-robust indicators $y_{\text{CM}}$ and $y_{\text{IM}}$ to infer true correspondences and purify the training losses. Through a purified contrastive cross-modal loss and a purified intra-modal loss, GSC leverages early memorization to establish stable geometry and then refines representations with a temporal ensembling scheme. Empirical results on four benchmarks, including CC152K, show that GSC consistently outperforms state-of-the-art noisy-correspondence methods and remains robust across varying noise levels and real-world data, with practical impact for multimodal retrieval systems.

Abstract

Noisy correspondence that refers to mismatches in cross-modal data pairs, is prevalent on human-annotated or web-crawled datasets. Prior approaches to leverage such data mainly consider the application of uni-modal noisy label learning without amending the impact on both cross-modal and intra-modal geometrical structures in multimodal learning. Actually, we find that both structures are effective to discriminate noisy correspondence through structural differences when being well-established. Inspired by this observation, we introduce a Geometrical Structure Consistency (GSC) method to infer the true correspondence. Specifically, GSC ensures the preservation of geometrical structures within and between modalities, allowing for the accurate discrimination of noisy samples based on structural differences. Utilizing these inferred true correspondence labels, GSC refines the learning of geometrical structures by filtering out the noisy samples. Experiments across four cross-modal datasets confirm that GSC effectively identifies noisy samples and significantly outperforms the current leading methods.

Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning

TL;DR

This work tackles noisy cross-modal correspondence that distorts both cross-modal and intra-modal geometrical structures. It introduces Geometrical Structure Consistency (GSC), a method that simultaneously preserves cross-modal similarities and intra-modal structures, using two noise-robust indicators and to infer true correspondences and purify the training losses. Through a purified contrastive cross-modal loss and a purified intra-modal loss, GSC leverages early memorization to establish stable geometry and then refines representations with a temporal ensembling scheme. Empirical results on four benchmarks, including CC152K, show that GSC consistently outperforms state-of-the-art noisy-correspondence methods and remains robust across varying noise levels and real-world data, with practical impact for multimodal retrieval systems.

Abstract

Noisy correspondence that refers to mismatches in cross-modal data pairs, is prevalent on human-annotated or web-crawled datasets. Prior approaches to leverage such data mainly consider the application of uni-modal noisy label learning without amending the impact on both cross-modal and intra-modal geometrical structures in multimodal learning. Actually, we find that both structures are effective to discriminate noisy correspondence through structural differences when being well-established. Inspired by this observation, we introduce a Geometrical Structure Consistency (GSC) method to infer the true correspondence. Specifically, GSC ensures the preservation of geometrical structures within and between modalities, allowing for the accurate discrimination of noisy samples based on structural differences. Utilizing these inferred true correspondence labels, GSC refines the learning of geometrical structures by filtering out the noisy samples. Experiments across four cross-modal datasets confirm that GSC effectively identifies noisy samples and significantly outperforms the current leading methods.
Paper Structure (16 sections, 10 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 10 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Noisy correspondence impacts both cross-modal and intra-modal geometrical structures. Left: Cross-modal distance between mismatched text and image is initially distant but wrongly reduced. Right: Intra-modal structures of mismatched image (above) and text (below) are initially distinct but wrongly aligned, thus similar samples within the modality are pulled apart.
  • Figure 2: Geometrical Structure Consistency helps discriminate samples with noisy correspondence. The model is first trained on clean Flickr30K dataset, then evaluated on the same dataset with 40% simulated noise. Left: Calculated cross-modal similarity scores of both clean and noisy samples. Right: Calculated intra-modal similarity scores of both clean and noisy samples.
  • Figure 3: An overview of GSC. Left: The framework of GSC. GSC first extracts image and text representations through separate encoders, then simultaneously optimizes cross-modal and intra-modal objectives to preserve geometrical structure consistency. GSC leverages both structures to discriminate noisy samples and estimate the true correspondence indicator $y$, which can be further utilized to purify the overall learning. Right: GSC discriminates noisy samples by structural differences from both cross-modal and intra-modal aspects.
  • Figure 4: Analysis of different hyper-parameter combinations on Flicker30K with 40% noise. Left:$\gamma$ is the balancing parameter between $\mathcal{L}_{\text{CM}}$ and $\mathcal{L}_{\text{IM}}$. Right:$\beta_1$ and $\beta_2$ are separate momentums for the cross-modal and intra-modal temporal ensembling.
  • Figure 5: (a) The changing values of clean and noisy sample weight when the noise rate is 20%, 40%, and 60%. (b) Distribution of intra-modal geometrical similarity, including PDFs of clean and noisy pair similarities and estimated Gaussian distribution components. (c) Cross-modal weight distributions of GSC on clean and noisy pairs. (d) Intra-modal weight distributions of GSC on clean and noisy pairs. Experiments from (b) to (d) are conducted on Flickr30K with the noise rate of 0.4.