Table of Contents
Fetching ...

SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

Jing Liu, Donglai Wei, Yang Liu, Sipeng Zhang, Tong Yang, Wei Zhou, Weiping Ding, Victor C. M. Leung

TL;DR

TBPS requires bridging the semantic gap between images and natural language to locate visually similar individuals using text. The authors propose SCMM, a dual-encoder framework that adds sew calibration with adaptive margins for global cross-modal alignment and a cross-modal decoder-based masked caption modeling for fine-grained word-level correspondences. This combination achieves state-of-the-art Rank-1 performance on CUHK-PEDES (e.g., 73.81% with CLIP pretraining) and substantial gains on ICFG-PEDES and RSTPReID, validated through extensive ablations. The work offers a computationally efficient approach by discarding the decoder during inference while retaining rich cross-modal interaction during training.

Abstract

Text-Based Person Search (TBPS) aims to retrieve target person images from a large-scale gallery using natural language descriptions, posing fundamental challenges in cross-modal representation learning. Existing methods often struggle to bridge the semantic gap between heterogeneous modalities while capturing fine-grained correspondences essential for discriminating visually similar individuals. To address these challenges, we propose Sew Calibration and Masked Modeling (SCMM), a unified framework that calibrates cross-modal representations through complementary learning mechanisms. Notably, SCMM introduces two novel components: a sew calibration loss that dynamically aligns image-text features using quality-guided adaptive margins based on textual information density, and a masked caption modeling loss that establishes fine-grained cross-modal correspondences through transformer-based masked prediction. Additionally, the sew calibration mechanism implements bidirectional constraints to effectively compress same-identity features in the shared embedding space, while the masked modeling component leverages a cross-modal decoder to learn word-level visual-textual relationships, enabling discrimination of subtle attribute differences. Our dual-encoder architecture achieves an effective balance between representation capability and computational efficiency by employing a training-only decoder design. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReID benchmarks demonstrate that SCMM achieves state-of-the-art performance with Rank1 accuracies of 73.81%, 64.25%, and 57.35%, respectively. Comprehensive ablation studies validate the effectiveness of each proposed component.

SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

TL;DR

TBPS requires bridging the semantic gap between images and natural language to locate visually similar individuals using text. The authors propose SCMM, a dual-encoder framework that adds sew calibration with adaptive margins for global cross-modal alignment and a cross-modal decoder-based masked caption modeling for fine-grained word-level correspondences. This combination achieves state-of-the-art Rank-1 performance on CUHK-PEDES (e.g., 73.81% with CLIP pretraining) and substantial gains on ICFG-PEDES and RSTPReID, validated through extensive ablations. The work offers a computationally efficient approach by discarding the decoder during inference while retaining rich cross-modal interaction during training.

Abstract

Text-Based Person Search (TBPS) aims to retrieve target person images from a large-scale gallery using natural language descriptions, posing fundamental challenges in cross-modal representation learning. Existing methods often struggle to bridge the semantic gap between heterogeneous modalities while capturing fine-grained correspondences essential for discriminating visually similar individuals. To address these challenges, we propose Sew Calibration and Masked Modeling (SCMM), a unified framework that calibrates cross-modal representations through complementary learning mechanisms. Notably, SCMM introduces two novel components: a sew calibration loss that dynamically aligns image-text features using quality-guided adaptive margins based on textual information density, and a masked caption modeling loss that establishes fine-grained cross-modal correspondences through transformer-based masked prediction. Additionally, the sew calibration mechanism implements bidirectional constraints to effectively compress same-identity features in the shared embedding space, while the masked modeling component leverages a cross-modal decoder to learn word-level visual-textual relationships, enabling discrimination of subtle attribute differences. Our dual-encoder architecture achieves an effective balance between representation capability and computational efficiency by employing a training-only decoder design. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReID benchmarks demonstrate that SCMM achieves state-of-the-art performance with Rank1 accuracies of 73.81%, 64.25%, and 57.35%, respectively. Comprehensive ablation studies validate the effectiveness of each proposed component.
Paper Structure (26 sections, 10 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 26 sections, 10 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of the motivation behind our method. In cross-modal tasks, a compact and well-aligned image-text feature distribution in the shared embedding space is crucial for bridging the inter-modal gap. Additionally, capturing fine-grained image-text correspondences is equally vital to distinguish between similar individuals for text-based person searching.
  • Figure 2: Overview of our proposed SCMM. The framework consists of a dual-encoder for extracting image-text features and calibrating cross-modal representations with the sew calibration loss. We also include a decoder for performing cross-modal interaction with the task-driven Mask Caption Modeling. At the inference stage, we only utilize the classification (CLS) tokens from the dual-encoder to implement similarity search.
  • Figure 3: Illustration of sew calibration loss. The constraints are different between single-modal and cross-modal matching. $(A_{img}, A_{txt})$ denotes anchors in image-text feature distribution, while $(P_{img}, P_{txt})$ and $(N_{img}, N_{txt})$ denote positive and negative sample pairs, respectively. The sew calibration loss pushes negative sample pairs and pulls positive sample pairs, stitching cross-modal key information like a seam.
  • Figure 4: Comparison of singular values for image-text embedding features across CUHK-PEDES (a-b) and ICFG-PEDES (c-d) datasets. (a) and (c) depict the distribution of singular values, where a smaller inter-line gap indicates a closer cross-modal distribution. (b) and (d) present the logarithmic difference in singular values between the baseline and our approach for CUHK-PEDES and ICFG-PEDES, respectively.
  • Figure 5: Effect of the manual fixed margin parameters setting of our Sew Calibration loss in terms of Rank1 accuracy on the CUHK-PEDES.
  • ...and 2 more figures