Table of Contents
Fetching ...

PC$^2$: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval

Yue Duan, Zhangxuan Gu, Zhenzhe Ying, Lei Qi, Changhua Meng, Yinghuan Shi

TL;DR

This work tackles noisy correspondence learning in cross-modal retrieval by introducing PC^2, a framework that leverages a pseudo-classification task, pseudo-captioning for informative supervision of mismatched pairs, and a prediction-oscillation based mechanism to rectify correspondences. The method builds on a co-dividing training strategy to separate clean and noisy data and dynamically adjusts triplet margins to maximize reliable supervision. A new dataset, Noise of Web (NoW), provides a realistic benchmark with natural web-derived noise, enabling robust evaluation beyond synthetic noise. Empirical results show PC^2 consistently outperforms margin-based and existing NCL methods on Flickr30K, MS-COCO, and NoW, and the authors release both NoW and the code to foster further progress in robust cross-modal learning.

Abstract

In the realm of cross-modal retrieval, seamlessly integrating diverse modalities within multimedia remains a formidable challenge, especially given the complexities introduced by noisy correspondence learning (NCL). Such noise often stems from mismatched data pairs, which is a significant obstacle distinct from traditional noisy labels. This paper introduces Pseudo-Classification based Pseudo-Captioning (PC$^2$) framework to address this challenge. PC$^2$ offers a threefold strategy: firstly, it establishes an auxiliary "pseudo-classification" task that interprets captions as categorical labels, steering the model to learn image-text semantic similarity through a non-contrastive mechanism. Secondly, unlike prevailing margin-based techniques, capitalizing on PC$^2$'s pseudo-classification capability, we generate pseudo-captions to provide more informative and tangible supervision for each mismatched pair. Thirdly, the oscillation of pseudo-classification is borrowed to assistant the correction of correspondence. In addition to technical contributions, we develop a realistic NCL dataset called Noise of Web (NoW), which could be a new powerful NCL benchmark where noise exists naturally. Empirical evaluations of PC$^2$ showcase marked improvements over existing state-of-the-art robust cross-modal retrieval techniques on both simulated and realistic datasets with various NCL settings. The contributed dataset and source code are released at https://github.com/alipay/PC2-NoiseofWeb.

PC$^2$: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval

TL;DR

This work tackles noisy correspondence learning in cross-modal retrieval by introducing PC^2, a framework that leverages a pseudo-classification task, pseudo-captioning for informative supervision of mismatched pairs, and a prediction-oscillation based mechanism to rectify correspondences. The method builds on a co-dividing training strategy to separate clean and noisy data and dynamically adjusts triplet margins to maximize reliable supervision. A new dataset, Noise of Web (NoW), provides a realistic benchmark with natural web-derived noise, enabling robust evaluation beyond synthetic noise. Empirical results show PC^2 consistently outperforms margin-based and existing NCL methods on Flickr30K, MS-COCO, and NoW, and the authors release both NoW and the code to foster further progress in robust cross-modal learning.

Abstract

In the realm of cross-modal retrieval, seamlessly integrating diverse modalities within multimedia remains a formidable challenge, especially given the complexities introduced by noisy correspondence learning (NCL). Such noise often stems from mismatched data pairs, which is a significant obstacle distinct from traditional noisy labels. This paper introduces Pseudo-Classification based Pseudo-Captioning (PC) framework to address this challenge. PC offers a threefold strategy: firstly, it establishes an auxiliary "pseudo-classification" task that interprets captions as categorical labels, steering the model to learn image-text semantic similarity through a non-contrastive mechanism. Secondly, unlike prevailing margin-based techniques, capitalizing on PC's pseudo-classification capability, we generate pseudo-captions to provide more informative and tangible supervision for each mismatched pair. Thirdly, the oscillation of pseudo-classification is borrowed to assistant the correction of correspondence. In addition to technical contributions, we develop a realistic NCL dataset called Noise of Web (NoW), which could be a new powerful NCL benchmark where noise exists naturally. Empirical evaluations of PC showcase marked improvements over existing state-of-the-art robust cross-modal retrieval techniques on both simulated and realistic datasets with various NCL settings. The contributed dataset and source code are released at https://github.com/alipay/PC2-NoiseofWeb.
Paper Structure (24 sections, 11 equations, 7 figures, 9 tables)

This paper contains 24 sections, 11 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Illustrations for noisy correspondence and difference between currently popular margin-based methods and proposed pseudo-caption based $\textrm{PC}^{2}$. $\textrm{PC}^{2}$ aims to provide direct supervision for false positive pairs with pseudo-caption and larger margin ( i.e., $\alpha_{\textrm{large}}$) in triplet loss, rather than adjusting a smaller margin ( i.e., $\alpha_{\textrm{small}}$) to alleviate the negative influence of false positive pairs in margin-based methods.
  • Figure 2: Experimental results of NCR (left) vs. $\textrm{PC}^{2}$ (right). Our method shows a more robust learning performance on clean data, maintaining a gradually converging trend with minimal influence from noisy data. In contrast, NCR exhibits a more oscillating pattern in learning clean data, especially when starting to learn from noisy data, causing noticeable fluctuations in the loss of clean data.
  • Figure 3: Sample data pairs in NoW composed of website pages and their corresponding site meta-descriptions. Boxes with different colors are used to display the region proposals obtained by the detection model APT gu2023mobile trained by us.
  • Figure 4: Visualization of the procedures of pseudo-classification and pseudo-captioning in $\textrm{PC}^{2}$. Pseudo-classification: Given a batch of clean data $(I^{\texttt{c}}_{i},T^{\texttt{c}}_{i})$, we first calculate the embeddings of $I^{\texttt{c}}_{i}$ and $T^{\texttt{c}}_{i}$. Then we use $\mathcal{C}$ to obtain their pseudo-predictions $p^{\texttt{c}}_{i}$ and $q^{\texttt{c}}_{i}$, respectively. $q^{\texttt{c}}_{i}$ is used as the classification label to supervise the training of $\mathcal{C}$ on $p^{\texttt{c}}_{i}$ using the standard cross-entropy loss function, in hopes of reinforcing the training of image-text matching. Pseudo-captioning: Given noisy data $(I^{\texttt{n}}_{i},T^{\texttt{n}}_{i})$, we first discard its caption $T^{\texttt{n}}_{i}$. We input the embedding of $I^{\texttt{n}}_{i}$ into $\mathcal{C}$ to obtain its pseudo-prediction $p^{\texttt{n}}_{i}$, then find the most similar one (denoted as $p^{\texttt{c}}_{j}$) to $p^{\texttt{n}}_{i}$ among $p^{\texttt{c}}_{i}$ from the aforementioned batch of clean data being trained synchronously. We assign the corresponding caption of $p^{\texttt{c}}_{j}$ ( i.e., $T^{\texttt{c}}_{j}$) to $I^{\texttt{n}}_{i}$ as the pseudo-caption $T^{\texttt{p}}_{i}$, and also utilize a margin based on pseudo-prediction similarity to train the matching model with a triplet ranking loss.
  • Figure 5: Ablation studies on curve parameter $m$ on both Flickr30K and MS-COCO with 40% noise.
  • ...and 2 more figures