Table of Contents
Fetching ...

Disentangled Noisy Correspondence Learning

Zhuohang Dang, Minnan Luo, Jihong Wang, Chengyou Jia, Haochen Han, Herun Wan, Guang Dai, Xiaojun Chang, Jingdong Wang

TL;DR

DisNCL tackles the problem of noisy cross-modal correspondences in image-text retrieval by introducing an information-theoretic disentanglement framework that separates modality-invariant information (MII) from modality-exclusive information (MEI). By training in the MII subspace and employing softened, many-to-many cross-modal targets, the method achieves robust similarity predictions and mitigates MEI noise. The approach combines variational MI estimators, adversarial objectives, and a final joint loss that includes a regularizer to enforce disentanglement, yielding state-of-the-art performance on synthetic and real-world noisy benchmarks and clear evidence of effective disentanglement through MI reduction and visualization. These results suggest DisNCL's potential to improve real-world multi-modal retrieval systems where noisy alignments are prevalent and to inspire further theory-driven disentanglement in cross-modal learning.

Abstract

Cross-modal retrieval is crucial in understanding latent correspondences across modalities. However, existing methods implicitly assume well-matched training data, which is impractical as real-world data inevitably involves imperfect alignments, i.e., noisy correspondences. Although some works explore similarity-based strategies to address such noise, they suffer from sub-optimal similarity predictions influenced by modality-exclusive information (MEI), e.g., background noise in images and abstract definitions in texts. This issue arises as MEI is not shared across modalities, thus aligning it in training can markedly mislead similarity predictions. Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy. Inspired by the robustness of information bottlenecks against noise, we introduce DisNCL, a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning, to adaptively balance the extraction of MII and MEI with certifiable optimal cross-modal disentanglement efficacy. DisNCL then enhances similarity predictions in modality-invariant subspace, thereby greatly boosting similarity-based alleviation strategy for noisy correspondences. Furthermore, DisNCL introduces soft matching targets to model noisy many-to-many relationships inherent in multi-modal input for noise-robust and accurate cross-modal alignment. Extensive experiments confirm DisNCL's efficacy by 2% average recall improvement. Mutual information estimation and visualization results show that DisNCL learns meaningful MII/MEI subspaces, validating our theoretical analyses.

Disentangled Noisy Correspondence Learning

TL;DR

DisNCL tackles the problem of noisy cross-modal correspondences in image-text retrieval by introducing an information-theoretic disentanglement framework that separates modality-invariant information (MII) from modality-exclusive information (MEI). By training in the MII subspace and employing softened, many-to-many cross-modal targets, the method achieves robust similarity predictions and mitigates MEI noise. The approach combines variational MI estimators, adversarial objectives, and a final joint loss that includes a regularizer to enforce disentanglement, yielding state-of-the-art performance on synthetic and real-world noisy benchmarks and clear evidence of effective disentanglement through MI reduction and visualization. These results suggest DisNCL's potential to improve real-world multi-modal retrieval systems where noisy alignments are prevalent and to inspire further theory-driven disentanglement in cross-modal learning.

Abstract

Cross-modal retrieval is crucial in understanding latent correspondences across modalities. However, existing methods implicitly assume well-matched training data, which is impractical as real-world data inevitably involves imperfect alignments, i.e., noisy correspondences. Although some works explore similarity-based strategies to address such noise, they suffer from sub-optimal similarity predictions influenced by modality-exclusive information (MEI), e.g., background noise in images and abstract definitions in texts. This issue arises as MEI is not shared across modalities, thus aligning it in training can markedly mislead similarity predictions. Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy. Inspired by the robustness of information bottlenecks against noise, we introduce DisNCL, a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning, to adaptively balance the extraction of MII and MEI with certifiable optimal cross-modal disentanglement efficacy. DisNCL then enhances similarity predictions in modality-invariant subspace, thereby greatly boosting similarity-based alleviation strategy for noisy correspondences. Furthermore, DisNCL introduces soft matching targets to model noisy many-to-many relationships inherent in multi-modal input for noise-robust and accurate cross-modal alignment. Extensive experiments confirm DisNCL's efficacy by 2% average recall improvement. Mutual information estimation and visualization results show that DisNCL learns meaningful MII/MEI subspaces, validating our theoretical analyses.
Paper Structure (35 sections, 3 theorems, 13 equations, 6 figures, 5 tables)

This paper contains 35 sections, 3 theorems, 13 equations, 6 figures, 5 tables.

Key Result

Theorem 1

Given a multi-modal input pair $(V,T)$ with corresponding representations $F_V$ and $F_T$, the mutual information $I(T; F_T)$ and $I(V; F_V)$ can be decomposed into two complementary terms, i.e.,

Figures (6)

  • Figure 1: Illustrative comparisons of entangled methods and our DisNCL, where different colors indicate corresponding feature space. The red/green elements refer to MII and MEI, respectively. The event description and conceptual definition in textual MEI refers to 'history' and 'legally deaf' in text.
  • Figure 2: The overview of our DisNCL, where black and red arrows indicate the model forward and optimization constraints; the green blocks indicate MII, while the pink and blue ones denote MEI of $V$ and $T$, respectively.
  • Figure 3: Illustration of hard negative strategy's one-to-one (above) and our soft many-to-many correspondence (below), where gray blocks denote the masked pairs, and green/red blocks indicate positive/negative samples.
  • Figure 4: Ablation on disentanglement analysis, where Ours$^\dag$ and Ours$^*$ refer to DisNCL w/o $L_{Dis}+L_{reg}$ and $L_{reg}$. Ours$^\dag$ is further trained with using $(V_S,V_X), (T_S,T_X)$ to reconstruct $(V,T)$, ensuring disentangled representations capture all input information.
  • Figure 5: Soft target visualization in both image-to-text (above) and text-to-image (below) retrieval on 20% noise Flickr30K.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Theorem 1
  • proof
  • Definition 1
  • Theorem 2
  • proof
  • Theorem 3
  • proof