Table of Contents
Fetching ...

UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval

Yating Liu, Yaowei Li, Xiangyuan Lan, Wenming Yang, Zimo Liu, Qingmin Liao

TL;DR

This work tackles Text-based Person Retrieval (TPR) by transferring rich CLIP knowledge through a unified parameter-efficient transfer learning (PETL) framework. UP-Person integrates three lightweight modules—Prefix, LoRA, and Adapter—with two enhancements (S-Prefix and L-Adapter) to jointly capture local and global cross-modal information while mitigating inter-module conflicts. A parameter-free Similarity Distribution Matching (SDM) loss guides alignment between image and text representations, enabling strong performance with only a small fraction of trainable parameters. Empirically, UP-Person achieves state-of-the-art results on CUHK-PEDES, ICFG-PEDES, and RSTPReid, and demonstrates strong generalization and efficiency on coarse-grained cross-domain tasks, making it well-suited for edge deployment and multi-scenario deployments.

Abstract

Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel Unified Parameter-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7\% parameters. Code is available at https://github.com/Liu-Yating/UP-Person.

UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval

TL;DR

This work tackles Text-based Person Retrieval (TPR) by transferring rich CLIP knowledge through a unified parameter-efficient transfer learning (PETL) framework. UP-Person integrates three lightweight modules—Prefix, LoRA, and Adapter—with two enhancements (S-Prefix and L-Adapter) to jointly capture local and global cross-modal information while mitigating inter-module conflicts. A parameter-free Similarity Distribution Matching (SDM) loss guides alignment between image and text representations, enabling strong performance with only a small fraction of trainable parameters. Empirically, UP-Person achieves state-of-the-art results on CUHK-PEDES, ICFG-PEDES, and RSTPReid, and demonstrates strong generalization and efficiency on coarse-grained cross-domain tasks, making it well-suited for edge deployment and multi-scenario deployments.

Abstract

Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel Unified Parameter-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7\% parameters. Code is available at https://github.com/Liu-Yating/UP-Person.

Paper Structure

This paper contains 25 sections, 20 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: The motivation for our proposed method. (a) shows how PETL-based methods can transfer TPR-specific knowledge from both CLIP and training data, whereas full-tuning relies solely on the training data as its knowledge source. Full-tuning (lower) only utilizes the knowledge of the pre-trained CLIP at initialization and almost loses the original knowledge, which thus only retains the knowledge of TPR from training datasets (PKT). PETL (upper) fine-tunes a small parameters and keeps CLIP backbone frozen, which means that it can integrate both TPR-specific within CLIP (PKC) due to the retained parameters of original CLIP and TPR-specific knowledge from training data (PKT). Therefore, PETL methods can incorporate more knowledge compared to full-tuning if rationally designed. (b) On CUHK-PEDES, our approach reduces 95.1% training parameters and gains an improvement by 5.98% on R@1 compared to the full-tuning CLIP.
  • Figure 2: Overview of the proposed UP-Person framework.Left is the overall backbone of UP-Person, which consists of image encoder and text encoder based on CLIP, two PETL modules for both encoders, and one parameter-free loss function constraint SDM as optimization objective. Only a few parameters in PETL modules are fine-tuned in training phase, while the other original full backbone of CLIP is frozen. Right is the implementation details of one transformer block for both image and text encoders. In addition to prefix tokens in the keys and values of MHA, S-Prefix proposes a $S_{p}$ factor in attention calculator to enhance gradient propagation of prefix tokens. L-Adapter is proposed in two normalization layers to adjust the overall distribution and avoid submodule conflicts. LoRA is inserted to update the weights of keys and values. Overall, L-Adapter helps transfer global pedestrian features, while LoRA and S-Prefix, working together in MHA, focus on attention to promote local knowledge transferring for TPR. All blocks with dashed borderlines represent the fine-tuned modules. On the far right are the more specific implementation details of our L-Adapter and S-Prefix.
  • Figure 3: Illustration of S-Prefix. We utilize ${S_{p}}$ to denote the salable factor about attention of prefix to accelerate the convergence rate. S-Prefix submodules are inserted in all transformer layers of two branches.
  • Figure 4: Illustration of L-Adapter. (a) Sequential Adapter is connected behind MLP or MHA. (b) Sequential L-Adapter is connected behind layernorm. (c) Parallel Adapter always spans layernorm and MLP or MHA which contains other PETL submodules. (d) Parallel L-Adapter is inserted into layernorm with residual connection, which is separated from other PETL submodules and transfers knowledge independently. (e) LN-tuning unfreezes layernorm, making the original features fine-tuned directly.
  • Figure 5: R@1 and parameters of different CLIP-based methods on CUHK-PEDES.The horizontal coordinate denotes the number of fine-tuned parameters. The gray numbers and the radius of the circles both represent the entire model size.
  • ...and 2 more figures