Table of Contents
Fetching ...

CKDA: Cross-modality Knowledge Disentanglement and Alignment for Visible-Infrared Lifelong Person Re-identification

Zhenyu Cui, Jiahuan Zhou, Yuxin Peng

TL;DR

The paper tackles Visible-Infrared Lifelong Person Re-Identification (VI-LReID), where models must sequentially learn from both visible and infrared data without forgetting previously learned cross-modal knowledge. It introduces CKDA, a Cross-modality Knowledge Disentanglement and Alignment framework that explicitly separates modality-common and modality-specific knowledge using Modality-Common Prompting (MCP) and Modality-Specific Prompting (MSP), followed by Cross-modality Knowledge Aligning (CKA) with dual prototype-based spaces. The optimization combines a base loss with a prompting consistency term and inter-/intra-modality alignment losses: $\mathcal{L} = \mathcal{L}_{base} + \alpha L_p + \beta(\mu L_{inter} + (1-\mu) L_{intra})$. Experiments on four VI-LReID benchmarks show CKDA achieving state-of-the-art performance with notable anti-forgetting effects, illustrating the practical value of explicit knowledge disentanglement and balanced cross-modal alignment for day-night pedestrian re-identification.

Abstract

Lifelong person Re-IDentification (LReID) aims to match the same person employing continuously collected individual data from different scenarios. To achieve continuous all-day person matching across day and night, Visible-Infrared Lifelong person Re-IDentification (VI-LReID) focuses on sequential training on data from visible and infrared modalities and pursues average performance over all data. To this end, existing methods typically exploit cross-modal knowledge distillation to alleviate the catastrophic forgetting of old knowledge. However, these methods ignore the mutual interference of modality-specific knowledge acquisition and modality-common knowledge anti-forgetting, where conflicting knowledge leads to collaborative forgetting. To address the above problems, this paper proposes a Cross-modality Knowledge Disentanglement and Alignment method, called CKDA, which explicitly separates and preserves modality-specific knowledge and modality-common knowledge in a balanced way. Specifically, a Modality-Common Prompting (MCP) module and a Modality-Specific Prompting (MSP) module are proposed to explicitly disentangle and purify discriminative information that coexists and is specific to different modalities, avoiding the mutual interference between both knowledge. In addition, a Cross-modal Knowledge Alignment (CKA) module is designed to further align the disentangled new knowledge with the old one in two mutually independent inter- and intra-modality feature spaces based on dual-modality prototypes in a balanced manner. Extensive experiments on four benchmark datasets verify the effectiveness and superiority of our CKDA against state-of-the-art methods. The source code of this paper is available at https://github.com/PKU-ICST-MIPL/CKDA-AAAI2026.

CKDA: Cross-modality Knowledge Disentanglement and Alignment for Visible-Infrared Lifelong Person Re-identification

TL;DR

The paper tackles Visible-Infrared Lifelong Person Re-Identification (VI-LReID), where models must sequentially learn from both visible and infrared data without forgetting previously learned cross-modal knowledge. It introduces CKDA, a Cross-modality Knowledge Disentanglement and Alignment framework that explicitly separates modality-common and modality-specific knowledge using Modality-Common Prompting (MCP) and Modality-Specific Prompting (MSP), followed by Cross-modality Knowledge Aligning (CKA) with dual prototype-based spaces. The optimization combines a base loss with a prompting consistency term and inter-/intra-modality alignment losses: . Experiments on four VI-LReID benchmarks show CKDA achieving state-of-the-art performance with notable anti-forgetting effects, illustrating the practical value of explicit knowledge disentanglement and balanced cross-modal alignment for day-night pedestrian re-identification.

Abstract

Lifelong person Re-IDentification (LReID) aims to match the same person employing continuously collected individual data from different scenarios. To achieve continuous all-day person matching across day and night, Visible-Infrared Lifelong person Re-IDentification (VI-LReID) focuses on sequential training on data from visible and infrared modalities and pursues average performance over all data. To this end, existing methods typically exploit cross-modal knowledge distillation to alleviate the catastrophic forgetting of old knowledge. However, these methods ignore the mutual interference of modality-specific knowledge acquisition and modality-common knowledge anti-forgetting, where conflicting knowledge leads to collaborative forgetting. To address the above problems, this paper proposes a Cross-modality Knowledge Disentanglement and Alignment method, called CKDA, which explicitly separates and preserves modality-specific knowledge and modality-common knowledge in a balanced way. Specifically, a Modality-Common Prompting (MCP) module and a Modality-Specific Prompting (MSP) module are proposed to explicitly disentangle and purify discriminative information that coexists and is specific to different modalities, avoiding the mutual interference between both knowledge. In addition, a Cross-modal Knowledge Alignment (CKA) module is designed to further align the disentangled new knowledge with the old one in two mutually independent inter- and intra-modality feature spaces based on dual-modality prototypes in a balanced manner. Extensive experiments on four benchmark datasets verify the effectiveness and superiority of our CKDA against state-of-the-art methods. The source code of this paper is available at https://github.com/PKU-ICST-MIPL/CKDA-AAAI2026.

Paper Structure

This paper contains 18 sections, 15 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of different LReID methods when facing visible and infrared images. Existing methods suffer from the conflict between the new knowledge acquisition (e.g., acquiring radiation knowledge specific to the infrared modality) and the old knowledge preservation (e.g., preserving shape knowledge that coexists in both modalities, which conflicts with the former one). While our CKDA achieves knowledge balancing by explicitly aligning the disentangled modality-common and -specific knowledge.
  • Figure 2: Overview of our proposed Cross-modality Knowledge Disentanglement and Alignment (CKDA) method. The input images are first fed into the Modality-Common Prompting (MCP) module and the Modality-Specific Prompting (MSP) module to generate the corresponding prompted image tokens. Then, the Cross-modality Knowledge Aligning (CKA) module exploits the cross-modality knowledge prototype to align the above two kinds of knowledge, respectively.
  • Figure 3: The influence of hyperparameters in our CKDA.
  • Figure 4: Visualization of the generated common prompt $\bm{k}_{com}$, specific prompt $\bm{k}_{spe}$, visible image prompt $\bm{k}^{v}$), and infrared image prompt $\bm{k}^{v}$.
  • Figure 5: The t-SNE visualization of different training stages, where different colours represent different identities and $\bigcirc, \triangle$ represent features of infrared and visible modalities, respectively.