SCMM: Calibrating Cross-modal Representations for Text-Based Person Search
Jing Liu, Donglai Wei, Yang Liu, Sipeng Zhang, Tong Yang, Wei Zhou, Weiping Ding, Victor C. M. Leung
TL;DR
TBPS requires bridging the semantic gap between images and natural language to locate visually similar individuals using text. The authors propose SCMM, a dual-encoder framework that adds sew calibration with adaptive margins for global cross-modal alignment and a cross-modal decoder-based masked caption modeling for fine-grained word-level correspondences. This combination achieves state-of-the-art Rank-1 performance on CUHK-PEDES (e.g., 73.81% with CLIP pretraining) and substantial gains on ICFG-PEDES and RSTPReID, validated through extensive ablations. The work offers a computationally efficient approach by discarding the decoder during inference while retaining rich cross-modal interaction during training.
Abstract
Text-Based Person Search (TBPS) aims to retrieve target person images from a large-scale gallery using natural language descriptions, posing fundamental challenges in cross-modal representation learning. Existing methods often struggle to bridge the semantic gap between heterogeneous modalities while capturing fine-grained correspondences essential for discriminating visually similar individuals. To address these challenges, we propose Sew Calibration and Masked Modeling (SCMM), a unified framework that calibrates cross-modal representations through complementary learning mechanisms. Notably, SCMM introduces two novel components: a sew calibration loss that dynamically aligns image-text features using quality-guided adaptive margins based on textual information density, and a masked caption modeling loss that establishes fine-grained cross-modal correspondences through transformer-based masked prediction. Additionally, the sew calibration mechanism implements bidirectional constraints to effectively compress same-identity features in the shared embedding space, while the masked modeling component leverages a cross-modal decoder to learn word-level visual-textual relationships, enabling discrimination of subtle attribute differences. Our dual-encoder architecture achieves an effective balance between representation capability and computational efficiency by employing a training-only decoder design. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReID benchmarks demonstrate that SCMM achieves state-of-the-art performance with Rank1 accuracies of 73.81%, 64.25%, and 57.35%, respectively. Comprehensive ablation studies validate the effectiveness of each proposed component.
