CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval
Yating Liu, Yaowei Li, Zimo Liu, Wenming Yang, Yaowei Wang, Qingmin Liao
TL;DR
The paper tackles cross-modal gaps in Text-based Person Retrieval (TPR) under limited data by introducing CSKT, a CLIP-based framework that performs parameter-efficient knowledge transfer through Bidirectional Prompts Transferring (BPT) and Dual Adapters Transferring (DAT). By freezing the CLIP backbone and fine-tuning only about 12M parameters (roughly 7.4% of the full model), CSKT leverages CLIP's pre-trained cross-modal knowledge to achieve strong performance across three datasets. Empirical results show competitive or state-of-the-art performance on CUHK-PEDES, ICFG-PEDES, and RTSPReid with improved efficiency and generalization. This PETL-based approach demonstrates that targeted prompt and adapter mechanisms can enable robust, resource-efficient cross-modal alignment for TPR.
Abstract
Text-based Person Retrieval (TPR) aims to retrieve the target person images given a textual query. The primary challenge lies in bridging the substantial gap between vision and language modalities, especially when dealing with limited large-scale datasets. In this paper, we introduce a CLIP-based Synergistic Knowledge Transfer (CSKT) approach for TPR. Specifically, to explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections. Secondly, Dual Adapters Transferring (DAT) is designed to transfer knowledge on output side of Multi-Head Attention (MHA) in vision and language. This synergistic two-way collaborative mechanism promotes the early-stage feature fusion and efficiently exploits the existing knowledge of CLIP. CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model, demonstrating its remarkable efficiency, effectiveness and generalization.
