A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels

Haochen Han; Minnan Luo; Huan Liu; Fang Nan

A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels

Haochen Han, Minnan Luo, Huan Liu, Fang Nan

TL;DR

The UOT-RCL, a Unified framework based on Optimal Transport for Robust Cross-modal Retrieval, proposes a semantic alignment based on partial OT to progressively correct the noisy labels and proposes a novel cross-modal consistent cost function designed to blend different modalities and provide precise transport cost.

Abstract

Cross-modal retrieval (CMR) aims to establish interaction between different modalities, among which supervised CMR is emerging due to its flexibility in learning semantic category discrimination. Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data. However, even for unimodal data, precise annotation is expensive and time-consuming, and it becomes more challenging with the multimodal scenario. In practice, massive multimodal data are collected from the Internet with coarse annotation, which inevitably introduces noisy labels. Training with such misleading labels would bring two key challenges -- enforcing the multimodal samples to \emph{align incorrect semantics} and \emph{widen the heterogeneous gap}, resulting in poor retrieval performance. To tackle these challenges, this work proposes UOT-RCL, a Unified framework based on Optimal Transport (OT) for Robust Cross-modal Retrieval. First, we propose a semantic alignment based on partial OT to progressively correct the noisy labels, where a novel cross-modal consistent cost function is designed to blend different modalities and provide precise transport cost. Second, to narrow the discrepancy in multi-modal data, an OT-based relation alignment is proposed to infer the semantic-level cross-modal matching. Both of these two components leverage the inherent correlation among multi-modal data to facilitate effective cost function. The experiments on three widely-used cross-modal retrieval datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches and significantly improves the robustness against noisy labels.

A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels

TL;DR

Abstract

Paper Structure (32 sections, 18 equations, 8 figures, 3 tables)

This paper contains 32 sections, 18 equations, 8 figures, 3 tables.

Introduction
Related Works
Cross-Modal Retrieval
Learning with Noisy Labels
Optimal Transport
Background
Problem Setup
Optimal Transport Theory
Methodology
Identify Confident Cross-modal Pairs
Progressive Label Correction with Semantic Alignment
Cross-modal Consistent Cost Function
Progressive Label Correction as an OT Problem
Bridging Heterogeneous Gap with Relation Alignment
The Unified Training Objective
...and 17 more sections

Figures (8)

Figure 1: Training with noisy labels will result in poor cross-modal retrieval performance. On the one hand, noisy labels can wrongly enforce irrelevant samples to be similar in the shared space. On the other hand, noisy labels can confuse the discriminative connections among different modalities and thus widen the heterogeneous gap.
Figure 2: Illustration of UOT-RCL. Our method mainly contains two components: (1) Progressive Label Correction with Semantic Alignment that aims to mitigate the influence of noisy-labeled samples. (2) Bridging Heterogeneous Gap with Relation Alignment that aims to learn discriminative representations within the same semantics. The two components can be trained in a unified objective, facilitating robust cross-modal retrieval.
Figure 3: Cross-modal retrieval performance of the proposed method versus ELRCMR in terms of mAP scores on the validation set of Wikipedia dataset under different noisy ratios.
Figure 4: Cross-modal retrieval performance of our UOT-RCL in terms of mAP scores on the Wikipedia validation set with different values of $\gamma$. (a) The image-to-text performance under 20% noise ratio. (b) The text-to-image performance under 20% noise ratio. (c) The image-to-text performance under 80% noise ratio. (d) The text-to-image performance under 80% noise ratio. We set $\gamma$ to 0, 0.1, 0.95, 0.99, and 1 to study the influence on retrieval results.
Figure 5: Some retrieval cases on XMediaNet under 40% noise ratio. For each image query, i.e.(a)-(c), we show the top-5 ranked texts. For each text query, i.e.(d)-(e), we show the top-3 ranked images. We mark the corresponding label of the correct retrieval result in green and otherwise in red.
...and 3 more figures

A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels

TL;DR

Abstract

A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels

Authors

TL;DR

Abstract

Table of Contents

Figures (8)