Table of Contents
Fetching ...

YYDS: Visible-Infrared Person Re-Identification with Coarse Descriptions

Yunhao Du, Zhicheng Zhao, Fei Su

TL;DR

The paper tackles visible-infrared person re-identification (VI-ReID) under a new Refer-VI-ReID setting, where coarse text descriptions supplement infrared probes to recover color information missing in infrared images. It proposes YYDS, a Y-Y-shape architecture that disentangles color and texture via two branches and a joint relation module, combined with text-IoU regularization and KL-based distribution matching, plus CMKR, a cross-modal extension of k-reciprocal re-ranking with novel neighbor strategies and MA-LQE to mitigate modality bias. The authors demonstrate substantial improvements over state-of-the-art methods on SYSU-MM01, RegDB, and LLCM datasets, validating both components through extensive ablations. The work enables more robust cross-modal retrieval using natural language cues, with practical implications for 24-hour surveillance and search applications, and provides publicly available code.

Abstract

Visible-infrared person re-identification (VI-ReID) is challenging due to considerable cross-modality discrepancies. Existing works mainly focus on learning modality-invariant features while suppressing modality-specific ones. However, retrieving visible images only depends on infrared samples is an extreme problem because of the absence of color information. To this end, we present the Refer-VI-ReID settings, which aims to match target visible images from both infrared images and coarse language descriptions (e.g., "a man with red top and black pants") to complement the missing color information. To address this task, we design a Y-Y-shape decomposition structure, dubbed YYDS, to decompose and aggregate texture and color features of targets. Specifically, the text-IoU regularization strategy is firstly presented to facilitate the decomposition training, and a joint relation module is then proposed to infer the aggregation. Furthermore, the cross-modal version of k-reciprocal re-ranking algorithm is investigated, named CMKR, in which three neighbor search strategies and one local query expansion method are explored to alleviate the modality bias problem of the near neighbors. We conduct experiments on SYSU-MM01, RegDB and LLCM datasets with our manually annotated descriptions. Both YYDS and CMKR achieve remarkable improvements over SOTA methods on all three datasets. Codes are available at https://github.com/dyhBUPT/YYDS.

YYDS: Visible-Infrared Person Re-Identification with Coarse Descriptions

TL;DR

The paper tackles visible-infrared person re-identification (VI-ReID) under a new Refer-VI-ReID setting, where coarse text descriptions supplement infrared probes to recover color information missing in infrared images. It proposes YYDS, a Y-Y-shape architecture that disentangles color and texture via two branches and a joint relation module, combined with text-IoU regularization and KL-based distribution matching, plus CMKR, a cross-modal extension of k-reciprocal re-ranking with novel neighbor strategies and MA-LQE to mitigate modality bias. The authors demonstrate substantial improvements over state-of-the-art methods on SYSU-MM01, RegDB, and LLCM datasets, validating both components through extensive ablations. The work enables more robust cross-modal retrieval using natural language cues, with practical implications for 24-hour surveillance and search applications, and provides publicly available code.

Abstract

Visible-infrared person re-identification (VI-ReID) is challenging due to considerable cross-modality discrepancies. Existing works mainly focus on learning modality-invariant features while suppressing modality-specific ones. However, retrieving visible images only depends on infrared samples is an extreme problem because of the absence of color information. To this end, we present the Refer-VI-ReID settings, which aims to match target visible images from both infrared images and coarse language descriptions (e.g., "a man with red top and black pants") to complement the missing color information. To address this task, we design a Y-Y-shape decomposition structure, dubbed YYDS, to decompose and aggregate texture and color features of targets. Specifically, the text-IoU regularization strategy is firstly presented to facilitate the decomposition training, and a joint relation module is then proposed to infer the aggregation. Furthermore, the cross-modal version of k-reciprocal re-ranking algorithm is investigated, named CMKR, in which three neighbor search strategies and one local query expansion method are explored to alleviate the modality bias problem of the near neighbors. We conduct experiments on SYSU-MM01, RegDB and LLCM datasets with our manually annotated descriptions. Both YYDS and CMKR achieve remarkable improvements over SOTA methods on all three datasets. Codes are available at https://github.com/dyhBUPT/YYDS.
Paper Structure (20 sections, 15 equations, 6 figures, 5 tables)

This paper contains 20 sections, 15 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison between different task settings. (a) Visible-Infrared ReID. (b) Text-Image ReID. (c) Our proposed referring visible-infrared ReID.
  • Figure 2: Overview of our Y-Y-shape decomposition structure.
  • Figure 3: Left: The framework of YYDS, which includes a visible Y-shape branch and an infrared Y-shape branch. Each branch consists of a color encoder $E_c^*$, a texture encoder $E_t^*$ and a joint relation module (JRM) $E_j^*$. The two $E_t^*$ partially share weights to eliminate modality-specific information. During training, the overall framework is optimized by two ReID loss $L_t^{reid}$, $L_j^{reid}$ and one KL divergence loss $L_c^{kl}$ with text-IoU regularization. Right: The details of JRM, which is composed of texture-centered relation block $E_{j,t}^*$, color-centered relation block $E_{j,c}^*$ and joint relation block $E_{j,tc}^*$. $B$ is the batch size, $C$ is the channel dimension, and $H,W$ is the size of feature map.
  • Figure 4: The illustration of text-IoU regularization. Only $x_1$ shares the same identity with $x_0$, but $x_2$ and $x_3$ share similar color clues with $x_0$.
  • Figure 5: The illustration of different neighbor strategies from the perspective of distance matrices. (a) The baseline method searches for neighbors based on the original distance matrix. (b) The constrained strategy limits the search domain to the gallery set. (c) The divided strategy separately normalizes the four submatrices. (d) The extended strategy integrates the baseline and constrained strategies.
  • ...and 1 more figures