Table of Contents
Fetching ...

From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search

Jintao Sun, Hao Fei, Zhedong Zheng, Gangyi Ding

TL;DR

This work tackles data inefficiency in text-based person search by proposing a Filtering-WoRA paradigm that first filters a coreset from large synthetic datasets using cross-modal relevance (via BLIP-2) and then fine-tunes a model with Weighted Low-Rank Adaptation (WoRA). WoRA reparameterizes weight updates as $W_{WoRA} = m \frac{\beta W_{0} + \alpha BA}{\left\| \beta W_{0} + \alpha BA \right\|_{2}}$, introducing learnable scalars $\alpha$ and $\beta$ to balance magnitude and direction for efficient adaptation. Empirically, data filtering reduces training data volume and speeds up training, while WoRA reduces trainable parameters and FLOPs, achieving Recall@1 of 76.38% and mAP of 67.22% on CUHK-PEDES, with a 19.82% reduction in training time and substantial gains over LoRA/DoRA baselines. The approach also yields competitive results on RSTPReid and ICFG-PEDES, demonstrating the practical impact of combining data-centric curation with parameter-efficient adaptation for scalable, text-guided person search.

Abstract

In text-based person search endeavors, data generation has emerged as a prevailing practice, addressing concerns over privacy preservation and the arduous task of manual annotation. Although the number of synthesized data can be infinite in theory, the scientific conundrum persists that how much generated data optimally fuels subsequent model training. We observe that only a subset of the data in these constructed datasets plays a decisive role. Therefore, we introduce a new Filtering-WoRA paradigm, which contains a filtering algorithm to identify this crucial data subset and WoRA (Weighted Low-Rank Adaptation) learning strategy for light fine-tuning. The filtering algorithm is based on the cross-modality relevance to remove the lots of coarse matching synthesis pairs. As the number of data decreases, we do not need to fine-tune the entire model. Therefore, we propose a WoRA learning strategy to efficiently update a minimal portion of model parameters. WoRA streamlines the learning process, enabling heightened efficiency in extracting knowledge from fewer, yet potent, data instances. Extensive experimentation validates the efficacy of pretraining, where our model achieves advanced and efficient retrieval performance on challenging real-world benchmarks. Notably, on the CUHK-PEDES dataset, we have achieved a competitive mAP of 67.02% while reducing model training time by 19.82%.

From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search

TL;DR

This work tackles data inefficiency in text-based person search by proposing a Filtering-WoRA paradigm that first filters a coreset from large synthetic datasets using cross-modal relevance (via BLIP-2) and then fine-tunes a model with Weighted Low-Rank Adaptation (WoRA). WoRA reparameterizes weight updates as , introducing learnable scalars and to balance magnitude and direction for efficient adaptation. Empirically, data filtering reduces training data volume and speeds up training, while WoRA reduces trainable parameters and FLOPs, achieving Recall@1 of 76.38% and mAP of 67.22% on CUHK-PEDES, with a 19.82% reduction in training time and substantial gains over LoRA/DoRA baselines. The approach also yields competitive results on RSTPReid and ICFG-PEDES, demonstrating the practical impact of combining data-centric curation with parameter-efficient adaptation for scalable, text-guided person search.

Abstract

In text-based person search endeavors, data generation has emerged as a prevailing practice, addressing concerns over privacy preservation and the arduous task of manual annotation. Although the number of synthesized data can be infinite in theory, the scientific conundrum persists that how much generated data optimally fuels subsequent model training. We observe that only a subset of the data in these constructed datasets plays a decisive role. Therefore, we introduce a new Filtering-WoRA paradigm, which contains a filtering algorithm to identify this crucial data subset and WoRA (Weighted Low-Rank Adaptation) learning strategy for light fine-tuning. The filtering algorithm is based on the cross-modality relevance to remove the lots of coarse matching synthesis pairs. As the number of data decreases, we do not need to fine-tune the entire model. Therefore, we propose a WoRA learning strategy to efficiently update a minimal portion of model parameters. WoRA streamlines the learning process, enabling heightened efficiency in extracting knowledge from fewer, yet potent, data instances. Extensive experimentation validates the efficacy of pretraining, where our model achieves advanced and efficient retrieval performance on challenging real-world benchmarks. Notably, on the CUHK-PEDES dataset, we have achieved a competitive mAP of 67.02% while reducing model training time by 19.82%.
Paper Structure (11 sections, 3 equations, 8 figures, 5 tables)

This paper contains 11 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Comparison between the proposed method and existing approaches in terms of Recall@1 and the parameter numbers. We observe that our method deploys fewer parameters while achieving a competitive Recall@1, i.e., APTM APTM, RaSa Bai_2023, IRRA jiang2023crossmodal and TBPS-CLIP cao2023empirical.
  • Figure 2: An overview of our framework. (a) shows the flow chart of the entire training pipeline. We first obtain the filtered training image-text pairs. Then we augment the text as attribute texts according to keywords. We extract the corresponding features through image encoder, text encoder and cross encoder. There are six loss objectives of both text-image and attribute-image matching tasks. (b) is an in-depth illustration of WoRA methodology, meticulously applied within the context of an image encoder. The model is updated by fine-tuning the decomposition of the pre-trained weights into amplitude and direction components and updating both components using LoRA hu2021lora while adding the $\alpha$ and $\beta$ learnable parameters. Since the image encoder consumes most GPU memory and time. In practice, we mainly apply the WoRA on the image encoder.
  • Figure 3: An overview of our data filtering process. We first employ Blip-2 li2023blip2 to extract features from the input image-text pair ${(I, T)}$ and the distractor text $T_{C}$. Next, we compute the similarity and rank the results accordingly, ultimately generating the filtered dataset.
  • Figure 4: Visual explanation of data filtering. The part on the left of the image shows the high-quality image retained after our screening strategy and its corresponding red text description, while the person image on the right represents the low-quality image text pairs that are filtered out beyond the threshold, i.e., top50. We deploy the real-world training set as distractors to filter low-relevance synthesized image-text pairs according to the similarity since there are no overlaps.
  • Figure 5: Intuitive comparison of LoRA, DoRA, and our proposed WoRA. (a) Here we show a common case during optimization, i.e., negative correlation against $W_{0}$, which both LoRA and DoRA are struggling with. The bias parameter $BA$ is hard to learn, considering the weight decay and other regularization. (b) In contrast, we deploy two float scalars in WoRA, i.e., $\alpha$ and $\beta$, which could efficiently adjust the vector and provide better flexibility.
  • ...and 3 more figures