SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

Jing Liu; Donglai Wei; Yang Liu; Sipeng Zhang; Tong Yang; Wei Zhou; Weiping Ding; Victor C. M. Leung

SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

Jing Liu, Donglai Wei, Yang Liu, Sipeng Zhang, Tong Yang, Wei Zhou, Weiping Ding, Victor C. M. Leung

TL;DR

TBPS requires bridging the semantic gap between images and natural language to locate visually similar individuals using text. The authors propose SCMM, a dual-encoder framework that adds sew calibration with adaptive margins for global cross-modal alignment and a cross-modal decoder-based masked caption modeling for fine-grained word-level correspondences. This combination achieves state-of-the-art Rank-1 performance on CUHK-PEDES (e.g., 73.81% with CLIP pretraining) and substantial gains on ICFG-PEDES and RSTPReID, validated through extensive ablations. The work offers a computationally efficient approach by discarding the decoder during inference while retaining rich cross-modal interaction during training.

Abstract

Text-Based Person Search (TBPS) aims to retrieve target person images from a large-scale gallery using natural language descriptions, posing fundamental challenges in cross-modal representation learning. Existing methods often struggle to bridge the semantic gap between heterogeneous modalities while capturing fine-grained correspondences essential for discriminating visually similar individuals. To address these challenges, we propose Sew Calibration and Masked Modeling (SCMM), a unified framework that calibrates cross-modal representations through complementary learning mechanisms. Notably, SCMM introduces two novel components: a sew calibration loss that dynamically aligns image-text features using quality-guided adaptive margins based on textual information density, and a masked caption modeling loss that establishes fine-grained cross-modal correspondences through transformer-based masked prediction. Additionally, the sew calibration mechanism implements bidirectional constraints to effectively compress same-identity features in the shared embedding space, while the masked modeling component leverages a cross-modal decoder to learn word-level visual-textual relationships, enabling discrimination of subtle attribute differences. Our dual-encoder architecture achieves an effective balance between representation capability and computational efficiency by employing a training-only decoder design. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReID benchmarks demonstrate that SCMM achieves state-of-the-art performance with Rank1 accuracies of 73.81%, 64.25%, and 57.35%, respectively. Comprehensive ablation studies validate the effectiveness of each proposed component.

SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

TL;DR

Abstract

Paper Structure (26 sections, 10 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 26 sections, 10 equations, 7 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Text-Based Person Search
Metric Learning
Masked Language Modeling
Proposed Methods
Sew Calibration Loss with Constraints
Masked Caption Modeling Loss
Total Loss
Experiments
Datasets and Evaluation Metric
Implementation Details
Comparison with State-of-the-art Methods
Results on CUHK-PEDES.
Results on ICFG-PEDES and RSTPReid.
...and 11 more sections

Figures (7)

Figure 1: Illustration of the motivation behind our method. In cross-modal tasks, a compact and well-aligned image-text feature distribution in the shared embedding space is crucial for bridging the inter-modal gap. Additionally, capturing fine-grained image-text correspondences is equally vital to distinguish between similar individuals for text-based person searching.
Figure 2: Overview of our proposed SCMM. The framework consists of a dual-encoder for extracting image-text features and calibrating cross-modal representations with the sew calibration loss. We also include a decoder for performing cross-modal interaction with the task-driven Mask Caption Modeling. At the inference stage, we only utilize the classification (CLS) tokens from the dual-encoder to implement similarity search.
Figure 3: Illustration of sew calibration loss. The constraints are different between single-modal and cross-modal matching. $(A_{img}, A_{txt})$ denotes anchors in image-text feature distribution, while $(P_{img}, P_{txt})$ and $(N_{img}, N_{txt})$ denote positive and negative sample pairs, respectively. The sew calibration loss pushes negative sample pairs and pulls positive sample pairs, stitching cross-modal key information like a seam.
Figure 4: Comparison of singular values for image-text embedding features across CUHK-PEDES (a-b) and ICFG-PEDES (c-d) datasets. (a) and (c) depict the distribution of singular values, where a smaller inter-line gap indicates a closer cross-modal distribution. (b) and (d) present the logarithmic difference in singular values between the baseline and our approach for CUHK-PEDES and ICFG-PEDES, respectively.
Figure 5: Effect of the manual fixed margin parameters setting of our Sew Calibration loss in terms of Rank1 accuracy on the CUHK-PEDES.
...and 2 more figures

SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

TL;DR

Abstract

SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

Authors

TL;DR

Abstract

Table of Contents

Figures (7)