Table of Contents
Fetching ...

Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification

Kunlun Xu, Haotong Cheng, Jiangmeng Li, Xu Zou, Jiahuan Zhou

Abstract

Lifelong person re-identification (LReID) aims to learn from varying domains to obtain a unified person retrieval model. Existing LReID approaches typically focus on learning from scratch or a visual classification-pretrained model, while the Vision-Language Model (VLM) has shown generalizable knowledge in a variety of tasks. Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). Our key idea is to explicitly model the universally shared human attributes to improve inter-domain knowledge transfer, thereby effectively utilizing historical knowledge to reinforce new knowledge learning and alleviate forgetting. Specifically, VLADR includes a Multi-grain Text Attribute Disentanglement mechanism that mines the global and diverse local text attributes of an image. Then, an Inter-domain Cross-modal Attribute Reinforcement scheme is developed, which introduces cross-modal attribute alignment to guide visual attribute extraction and adopts inter-domain attribute alignment to achieve fine-grained knowledge transfer. Experimental results demonstrate that our VLADR outperforms the state-of-the-art methods by 1.9\%-2.2\% and 2.1\%-2.5\% on anti-forgetting and generalization capacity. Our source code is available at https://github.com/zhoujiahuan1991/CVPR2026-VLADR

Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification

Abstract

Lifelong person re-identification (LReID) aims to learn from varying domains to obtain a unified person retrieval model. Existing LReID approaches typically focus on learning from scratch or a visual classification-pretrained model, while the Vision-Language Model (VLM) has shown generalizable knowledge in a variety of tasks. Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). Our key idea is to explicitly model the universally shared human attributes to improve inter-domain knowledge transfer, thereby effectively utilizing historical knowledge to reinforce new knowledge learning and alleviate forgetting. Specifically, VLADR includes a Multi-grain Text Attribute Disentanglement mechanism that mines the global and diverse local text attributes of an image. Then, an Inter-domain Cross-modal Attribute Reinforcement scheme is developed, which introduces cross-modal attribute alignment to guide visual attribute extraction and adopts inter-domain attribute alignment to achieve fine-grained knowledge transfer. Experimental results demonstrate that our VLADR outperforms the state-of-the-art methods by 1.9\%-2.2\% and 2.1\%-2.5\% on anti-forgetting and generalization capacity. Our source code is available at https://github.com/zhoujiahuan1991/CVPR2026-VLADR
Paper Structure (17 sections, 12 equations, 6 figures, 3 tables)

This paper contains 17 sections, 12 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (a) Existing methods rely on global feature learning when exploiting VLM, suffering from misguided attention on background and insufficient utilization of the human semantics. (b) Our method explicitly models and transfers human attribution knowledge across domains, achieving more precise human semantic acquisition and continual attribute knowledge reinforcement.
  • Figure 2: Overview of our framework. Given the new data $D_t$, two stages operate sequentially. (a) Multi-grain Text Attribute Disentanglement mechanism conducts local attribution extraction and a global attribute model based on pretrained VLMs. Then, these disentangled multi-grain attributes are utilized by (b) Inter-domain Cross-modal Attribute Reinforcement scheme, which conducts visual-text alignment $\mathcal{L}_{MAlign}$ to guide visual attribute extraction. Besides, the inter-domain attribute alignment $\mathcal{L}_{DAlign}$ is also performed to accumulate the continually learned knowledge.
  • Figure 3: Tendency on seen domain knowledge accumulation.
  • Figure 4: Tendency on unseen domain generalization.
  • Figure 5: Attention map comparison across different approaches.
  • ...and 1 more figures