Table of Contents
Fetching ...

DM-Adapter: Domain-Aware Mixture-of-Adapters for Text-Based Person Retrieval

Yating Liu, Zimo Liu, Xiangyuan Lan, Wenming Yang, Yaowei Li, Qingmin Liao

TL;DR

This work tackles text-based person retrieval by addressing the inefficiency of full-model fine-tuning and the lack of fine-grained domain adaptation in PETL methods. It introduces DM-Adapter, which unifies Sparse Mixture-of-Adapters (SMA) with a Domain-Aware Router (DR) inserted into the MLPs of both CLIP branches, guided by a load-balancing loss and a domain-informed gating mechanism. The approach uses an end-to-end training objective combining Similarity Distribution Matching with LB losses, achieving state-of-the-art results on CUHK-PEDES, ICFG-PEDES, and RSTPReid with only $16$M trainable parameters and demonstrating strong memory efficiency. This domain-aware MOE-PETL framework enhances fine-grained person knowledge transfer while maintaining computational practicality, offering a robust solution for real-world TPR tasks.

Abstract

Text-based person retrieval (TPR) has gained significant attention as a fine-grained and challenging task that closely aligns with practical applications. Tailoring CLIP to person domain is now a emerging research topic due to the abundant knowledge of vision-language pretraining, but challenges still remain during fine-tuning: (i) Previous full-model fine-tuning in TPR is computationally expensive and prone to overfitting.(ii) Existing parameter-efficient transfer learning (PETL) for TPR lacks of fine-grained feature extraction. To address these issues, we propose Domain-Aware Mixture-of-Adapters (DM-Adapter), which unifies Mixture-of-Experts (MOE) and PETL to enhance fine-grained feature representations while maintaining efficiency. Specifically, Sparse Mixture-of-Adapters is designed in parallel to MLP layers in both vision and language branches, where different experts specialize in distinct aspects of person knowledge to handle features more finely. To promote the router to exploit domain information effectively and alleviate the routing imbalance, Domain-Aware Router is then developed by building a novel gating function and injecting learnable domain-aware prompts. Extensive experiments show that our DM-Adapter achieves state-of-the-art performance, outperforming previous methods by a significant margin.

DM-Adapter: Domain-Aware Mixture-of-Adapters for Text-Based Person Retrieval

TL;DR

This work tackles text-based person retrieval by addressing the inefficiency of full-model fine-tuning and the lack of fine-grained domain adaptation in PETL methods. It introduces DM-Adapter, which unifies Sparse Mixture-of-Adapters (SMA) with a Domain-Aware Router (DR) inserted into the MLPs of both CLIP branches, guided by a load-balancing loss and a domain-informed gating mechanism. The approach uses an end-to-end training objective combining Similarity Distribution Matching with LB losses, achieving state-of-the-art results on CUHK-PEDES, ICFG-PEDES, and RSTPReid with only M trainable parameters and demonstrating strong memory efficiency. This domain-aware MOE-PETL framework enhances fine-grained person knowledge transfer while maintaining computational practicality, offering a robust solution for real-world TPR tasks.

Abstract

Text-based person retrieval (TPR) has gained significant attention as a fine-grained and challenging task that closely aligns with practical applications. Tailoring CLIP to person domain is now a emerging research topic due to the abundant knowledge of vision-language pretraining, but challenges still remain during fine-tuning: (i) Previous full-model fine-tuning in TPR is computationally expensive and prone to overfitting.(ii) Existing parameter-efficient transfer learning (PETL) for TPR lacks of fine-grained feature extraction. To address these issues, we propose Domain-Aware Mixture-of-Adapters (DM-Adapter), which unifies Mixture-of-Experts (MOE) and PETL to enhance fine-grained feature representations while maintaining efficiency. Specifically, Sparse Mixture-of-Adapters is designed in parallel to MLP layers in both vision and language branches, where different experts specialize in distinct aspects of person knowledge to handle features more finely. To promote the router to exploit domain information effectively and alleviate the routing imbalance, Domain-Aware Router is then developed by building a novel gating function and injecting learnable domain-aware prompts. Extensive experiments show that our DM-Adapter achieves state-of-the-art performance, outperforming previous methods by a significant margin.

Paper Structure

This paper contains 29 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Evolution of CLIP-based paradigms for text-based person retrieval. (a) The FFT-based method unfreezes and trains the entire model. (b) The recent PETL-based method freezes CLIP and uses a single adapter on the input token as shown in the left. Our mixture-of-adapters achieves the fine-grained knowledge transferring with MOE in the right.
  • Figure 2: Comparison with CLIP-based methods. Our approach achieves the best trade-off between performance and parameter efficiency.
  • Figure 3: The overall framework of the proposed method. We adopt CLIP (ViT-B/16) as backbone, and design Domain-Aware Mixture-of-Adapters spanning MLP layer. The full parameters of vanilla CLIP are frozen during training phase. Only a fewer of parameters in DM-Adapter are trainable. The overall optimization objective incorporates SDM loss and LB auxiliary loss.
  • Figure 4: Architecture of DM-Adapter. DM-Adapter is mainly composed of SMA and DR. DR inserts novel domain-aware prompts on the input tokens, and designs an domain gating to capture these prompts.
  • Figure 5: The results of experiments for hyper-parameters.
  • ...and 1 more figures