Table of Contents
Fetching ...

Noisy-Correspondence Learning for Text-to-Image Person Re-identification

Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, Peng Hu

TL;DR

This work addresses noisy image-text correspondence in text-to-image person re-identification. It introduces Robust Dual Embedding (RDE) that combines Confident Consensus Division to filter clean data and Triplet Alignment Loss to stabilize learning under NC while leveraging a dual embedding scheme for both global and token focused cross-modal interactions. The proposed TAL upper bounds traditional triplet losses and integrates hard negative mining with stability, leading to state-of-the-art results on CUHK-PEDES, ICFG-PEDES, and RSTPReID under synthetic NC and clean conditions. The method demonstrates robust performance and reduced overfitting, with practical implications for real-world TIReID systems using noisy supervision.

Abstract

Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community, which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance, they implicitly assume the training image-text pairs are correctly aligned, which is not always the case in real-world scenarios. In practice, the image-text pairs inevitably exist under-correlated or even false-correlated, a.k.a noisy correspondence (NC), due to the low quality of the images and annotation errors. To address this problem, we propose a novel Robust Dual Embedding method (RDE) that can learn robust visual-semantic associations even with NC. Specifically, RDE consists of two main components: 1) A Confident Consensus Division (CCD) module that leverages the dual-grained decisions of dual embedding modules to obtain a consensus set of clean training data, which enables the model to learn correct and reliable visual-semantic associations. 2) A Triplet Alignment Loss (TAL) relaxes the conventional Triplet Ranking loss with the hardest negative samples to a log-exponential upper bound over all negative ones, thus preventing the model collapse under NC and can also focus on hard-negative samples for promising performance. We conduct extensive experiments on three public benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID, to evaluate the performance and robustness of our RDE. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on all three datasets. Code is available at https://github.com/QinYang79/RDE.

Noisy-Correspondence Learning for Text-to-Image Person Re-identification

TL;DR

This work addresses noisy image-text correspondence in text-to-image person re-identification. It introduces Robust Dual Embedding (RDE) that combines Confident Consensus Division to filter clean data and Triplet Alignment Loss to stabilize learning under NC while leveraging a dual embedding scheme for both global and token focused cross-modal interactions. The proposed TAL upper bounds traditional triplet losses and integrates hard negative mining with stability, leading to state-of-the-art results on CUHK-PEDES, ICFG-PEDES, and RSTPReID under synthetic NC and clean conditions. The method demonstrates robust performance and reduced overfitting, with practical implications for real-world TIReID systems using noisy supervision.

Abstract

Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community, which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance, they implicitly assume the training image-text pairs are correctly aligned, which is not always the case in real-world scenarios. In practice, the image-text pairs inevitably exist under-correlated or even false-correlated, a.k.a noisy correspondence (NC), due to the low quality of the images and annotation errors. To address this problem, we propose a novel Robust Dual Embedding method (RDE) that can learn robust visual-semantic associations even with NC. Specifically, RDE consists of two main components: 1) A Confident Consensus Division (CCD) module that leverages the dual-grained decisions of dual embedding modules to obtain a consensus set of clean training data, which enables the model to learn correct and reliable visual-semantic associations. 2) A Triplet Alignment Loss (TAL) relaxes the conventional Triplet Ranking loss with the hardest negative samples to a log-exponential upper bound over all negative ones, thus preventing the model collapse under NC and can also focus on hard-negative samples for promising performance. We conduct extensive experiments on three public benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID, to evaluate the performance and robustness of our RDE. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on all three datasets. Code is available at https://github.com/QinYang79/RDE.
Paper Structure (39 sections, 2 theorems, 22 equations, 14 figures, 8 tables, 1 algorithm)

This paper contains 39 sections, 2 theorems, 22 equations, 14 figures, 8 tables, 1 algorithm.

Key Result

Lemma 1

TAL is the upper bound of TRL, i.e., where $\hat{T}_i\in\{ T_j|l_{ij}=0, \forall j\in \{1,\cdots,K\} \}$ is the hardest negative text for $I_i$ and $\hat{I}_i\in\{ I_j|l_{ji}=0, \forall j\in \{1,\cdots,K\} \}$ is the hardest negative image for $I_i$, respectively.

Figures (14)

  • Figure 1: The illustration of noisy correspondence. The figure shows an example of the NC problem, which occurs when the image-text pairs are wrongly aligned, i.e., false positive pairs (FPPs). Since the model does not know which pairs are noisy in practice, they will unavoidably degrade the performance by incorrect supervision information. As seen in the figure, (a) the clean image-text pair is semantically matched, while (b) the noisy pair is not, which would cause the cross-modal model to learn erroneous visual-textual associations. Note that both examples in (a) and (b) are from and actually exist in the RSTPReid dataset zhu2021dssl.
  • Figure 2: The overview of our RDE. (a) is the illustration of the cross-modal embedding model used in RDE, which consists of basical global embedding (BGE) and token selection embedding (TSE) modules with different granularity. By integrating them, RDE can capture coarse-grained cross-modal interactions while selecting informative local token features to encode more fine-grained representations for a more accurate similarity. (b) shows the core of RDE to achieve robust similarity learning, which consists of Confident Consensus Division (CCD) and Triplet Alignment Loss (TAL). CCD performs consensus division to obtain confident clean training data, thus avoiding misleading from noisy pairs. Unlike traditional Triplet Ranking Loss (TRL) faghri2017vse++, TAL exploits an upper bound to consider all negative pairs, thus embracing more stable learning.
  • Figure 3: The difference between TRL, TRL-S, and proposed TAL on the similarity distribution versus iterations. The $y$-$z$ plane represents the similarity density. The corresponding Rank-1 scores of testing are placed in brackets for convenience.
  • Figure 4: Variation of performance with different $m$ and $\tau$.
  • Figure 5: Test performance (Rank-1) versus epochs on the CHUK-PEDES and ICFG-PEDES datasets with 50% noise.
  • ...and 9 more figures

Theorems & Definitions (3)

  • Lemma 1
  • Lemma 2
  • Proof 1