Table of Contents
Fetching ...

Dynamic Textual Prompt For Rehearsal-free Lifelong Person Re-identification

Hongyu Chen, Bingliang Jiao, Wenxuan Wang, Peng Wang

TL;DR

A task-driven dynamic textual prompt framework that effectively guides the ReID model to embed images into a unified semantic space, and developed a learnable knowledge distillation module that allows the model to dynamically balance retaining existing knowledge with acquiring new knowledge.

Abstract

Lifelong person re-identification attempts to recognize people across cameras and integrate new knowledge from continuous data streams. Key challenges involve addressing catastrophic forgetting caused by parameter updating and domain shift, and maintaining performance in seen and unseen domains. Many previous works rely on data memories to retain prior samples. However, the amount of retained data increases linearly with the number of training domains, leading to continually increasing memory consumption. Additionally, these methods may suffer significant performance degradation when data preservation is prohibited due to privacy concerns. To address these limitations, we propose using textual descriptions as guidance to encourage the ReID model to learn cross-domain invariant features without retaining samples. The key insight is that natural language can describe pedestrian instances with an invariant style, suggesting a shared textual space for any pedestrian images. By leveraging this shared textual space as an anchor, we can prompt the ReID model to embed images from various domains into a unified semantic space, thereby alleviating catastrophic forgetting caused by domain shifts. To achieve this, we introduce a task-driven dynamic textual prompt framework in this paper. This model features a dynamic prompt fusion module, which adaptively constructs and fuses two different textual prompts as anchors. This effectively guides the ReID model to embed images into a unified semantic space. Additionally, we design a text-visual feature alignment module to learn a more precise mapping between fine-grained visual and textual features. We also developed a learnable knowledge distillation module that allows our model to dynamically balance retaining existing knowledge with acquiring new knowledge. Extensive experiments demonstrate that our method outperforms SOTAs under various settings.

Dynamic Textual Prompt For Rehearsal-free Lifelong Person Re-identification

TL;DR

A task-driven dynamic textual prompt framework that effectively guides the ReID model to embed images into a unified semantic space, and developed a learnable knowledge distillation module that allows the model to dynamically balance retaining existing knowledge with acquiring new knowledge.

Abstract

Lifelong person re-identification attempts to recognize people across cameras and integrate new knowledge from continuous data streams. Key challenges involve addressing catastrophic forgetting caused by parameter updating and domain shift, and maintaining performance in seen and unseen domains. Many previous works rely on data memories to retain prior samples. However, the amount of retained data increases linearly with the number of training domains, leading to continually increasing memory consumption. Additionally, these methods may suffer significant performance degradation when data preservation is prohibited due to privacy concerns. To address these limitations, we propose using textual descriptions as guidance to encourage the ReID model to learn cross-domain invariant features without retaining samples. The key insight is that natural language can describe pedestrian instances with an invariant style, suggesting a shared textual space for any pedestrian images. By leveraging this shared textual space as an anchor, we can prompt the ReID model to embed images from various domains into a unified semantic space, thereby alleviating catastrophic forgetting caused by domain shifts. To achieve this, we introduce a task-driven dynamic textual prompt framework in this paper. This model features a dynamic prompt fusion module, which adaptively constructs and fuses two different textual prompts as anchors. This effectively guides the ReID model to embed images into a unified semantic space. Additionally, we design a text-visual feature alignment module to learn a more precise mapping between fine-grained visual and textual features. We also developed a learnable knowledge distillation module that allows our model to dynamically balance retaining existing knowledge with acquiring new knowledge. Extensive experiments demonstrate that our method outperforms SOTAs under various settings.

Paper Structure

This paper contains 20 sections, 11 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: The inspirations of our rehearsal-free method. Natural language descriptions of pedestrian images from various domains act as domain-independent anchors. These consistent descriptions guide the mapping of images into a unified semantic space, effectively mitigating the issue of catastrophic forgetting.
  • Figure 2: The proposed DTP framework comprises two main modules: Dynamic Prompt Fusion (DPF) and Text-Visual Feature Alignment (TFA). Initially, the DPF module dynamically fuses Invariant Prompts (IP) and Person Knowledge Prompts (PKP) as anchors in semantic space to establish the mapping between the image features and textual prompts. Subsequently, the introduced TFA module aligns local image features and textual prompts between fine-grained local features for a more refined mapping process. Moreover, the LKD module adjusts the temperature coefficient in the knowledge distillation process, dynamically balancing the plasticity and stability of the model.
  • Figure 3: The illustration of the Text-Visual Feature Alignment (TFA) module. The TFA module divides the image features into four blocks according to height and slices the text features according to the different body texts, achieving finer detail alignment by aligning local text and images.
  • Figure 4: The effectiveness of different hypermeters.
  • Figure 5: t-SNE visualization of feature distribution of Order-1, various colours demonstrate different domains. (a) Visualization of the image encoder and ours on seen datasets, the domain gaps between feature distributions are eliminated after applying our method. (b) Visualization of the image encoder and ours on unseen datasets and the results show that our method can even eliminate the domain gaps among unseen domains.
  • ...and 1 more figures