Disentangled Noisy Correspondence Learning
Zhuohang Dang, Minnan Luo, Jihong Wang, Chengyou Jia, Haochen Han, Herun Wan, Guang Dai, Xiaojun Chang, Jingdong Wang
TL;DR
DisNCL tackles the problem of noisy cross-modal correspondences in image-text retrieval by introducing an information-theoretic disentanglement framework that separates modality-invariant information (MII) from modality-exclusive information (MEI). By training in the MII subspace and employing softened, many-to-many cross-modal targets, the method achieves robust similarity predictions and mitigates MEI noise. The approach combines variational MI estimators, adversarial objectives, and a final joint loss that includes a regularizer to enforce disentanglement, yielding state-of-the-art performance on synthetic and real-world noisy benchmarks and clear evidence of effective disentanglement through MI reduction and visualization. These results suggest DisNCL's potential to improve real-world multi-modal retrieval systems where noisy alignments are prevalent and to inspire further theory-driven disentanglement in cross-modal learning.
Abstract
Cross-modal retrieval is crucial in understanding latent correspondences across modalities. However, existing methods implicitly assume well-matched training data, which is impractical as real-world data inevitably involves imperfect alignments, i.e., noisy correspondences. Although some works explore similarity-based strategies to address such noise, they suffer from sub-optimal similarity predictions influenced by modality-exclusive information (MEI), e.g., background noise in images and abstract definitions in texts. This issue arises as MEI is not shared across modalities, thus aligning it in training can markedly mislead similarity predictions. Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy. Inspired by the robustness of information bottlenecks against noise, we introduce DisNCL, a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning, to adaptively balance the extraction of MII and MEI with certifiable optimal cross-modal disentanglement efficacy. DisNCL then enhances similarity predictions in modality-invariant subspace, thereby greatly boosting similarity-based alleviation strategy for noisy correspondences. Furthermore, DisNCL introduces soft matching targets to model noisy many-to-many relationships inherent in multi-modal input for noise-robust and accurate cross-modal alignment. Extensive experiments confirm DisNCL's efficacy by 2% average recall improvement. Mutual information estimation and visualization results show that DisNCL learns meaningful MII/MEI subspaces, validating our theoretical analyses.
