Table of Contents
Fetching ...

Multi-Grained Vision-Language Alignment for Domain Generalized Person Re-Identification

Jiachen Li, Xiaojin Gong, Dongping Zhang

Abstract

Domain Generalized person Re-identification (DG Re-ID) is a challenging task, where models are trained on source domains but tested on unseen target domains. Although previous pure vision-based models have achieved significant progress, the performance remains further improved. Recently, Vision-Language Models (VLMs) present outstanding generalization capabilities in various visual applications. However, directly adapting a VLM to Re-ID shows limited generalization improvement. This is because the VLM only produces with global features that are insensitive to ID nuances. To tacle this problem, we propose a CLIP-based multi-grained vision-language alignment framework in this work. Specifically, several multi-grained prompts are introduced in language modality to describe different body parts and align with their counterparts in vision modality. To obtain fine-grained visual information, an adaptively masked multi-head self-attention module is employed to precisely extract specific part features. To train the proposed module, an MLLM-based visual grounding expert is employed to automatically generate pseudo labels of body parts for supervision. Extensive experiments conducted on both single- and multi-source generalization protocols demonstrate the superior performance of our approach. The implementation code will be released at https://github.com/RikoLi/MUVA.

Multi-Grained Vision-Language Alignment for Domain Generalized Person Re-Identification

Abstract

Domain Generalized person Re-identification (DG Re-ID) is a challenging task, where models are trained on source domains but tested on unseen target domains. Although previous pure vision-based models have achieved significant progress, the performance remains further improved. Recently, Vision-Language Models (VLMs) present outstanding generalization capabilities in various visual applications. However, directly adapting a VLM to Re-ID shows limited generalization improvement. This is because the VLM only produces with global features that are insensitive to ID nuances. To tacle this problem, we propose a CLIP-based multi-grained vision-language alignment framework in this work. Specifically, several multi-grained prompts are introduced in language modality to describe different body parts and align with their counterparts in vision modality. To obtain fine-grained visual information, an adaptively masked multi-head self-attention module is employed to precisely extract specific part features. To train the proposed module, an MLLM-based visual grounding expert is employed to automatically generate pseudo labels of body parts for supervision. Extensive experiments conducted on both single- and multi-source generalization protocols demonstrate the superior performance of our approach. The implementation code will be released at https://github.com/RikoLi/MUVA.
Paper Structure (34 sections, 11 equations, 8 figures, 13 tables, 1 algorithm)

This paper contains 34 sections, 11 equations, 8 figures, 13 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of the single- and multi-grained alignments. (a) indicates the single-grained alignment. Similar IDs from diverse visual domains can be summarized by a sentence in a global view. Domain-specific attributes such as illumination, view angle, image resolution and so on are eliminated due to the nature of language. However, the single-grained description ignores ID nuances. (b) indicates our multi-grained alignment, where extra fine-grained descriptions of parts are employed to represent more subtle ID differences. Note that exact words in descriptions are used as examples to explain our idea. In implementation, several learnable prompts are adopted for more flexible descriptions instead of manually designed words.
  • Figure 2: Pipeline of our framework. Before training, the VGE generates part locations for upcoming usage. In stage-1, the multi-grained learnable prompts of each ID are aligned with correponding images. The part locations are directly used to extract local visual features. In stage-2, the proposed AM-MSA module is adopted in each transformer layer to extract local features and optimized with the entire image encoder. In this procedure, the part locations act as pseudo labels to supervise the AM-MSA module. Prototypical memories are utilized to train two stages. In inference, only the image encoder is utilized.
  • Figure 3: Illustration of the AM-MSA module. (a) indicates its inner architecture. The red path highlights a bypass branch which adaptively generates an attention mask $\mathbf{A}_l$ for local information aggregation in a transformer. For simplicity, the residual connections in the transformer are omitted. (b) indicates the architecture of the RMP module, where a cross-attention layer, an MLP layer, several layer normalization layers and a sigmoid activation function are used to predict the foreground mask of each part.
  • Figure 4: Visualization of part locating with VGE on Market1501 Market1501 dataset. Different parts are detected in green boxes.
  • Figure 5: Illustration of the imperfect results. In (a), the distribution of the height of bounding boxes for leg regions from Market1501 Market1501 dataset is presented. Blue denotes the raw results and red denotes the calibrated results. Without calibration, we find irregular peaks at values approximating the entire image height, which correspond to incorrect locations containing the whole person body. Extremely small values are also incorrect, which often represent over-splitting. (b) is a sample of oversized partition. (c) and (d) are examples of over-split leg regions. Green boxes are predictions of the VGE.
  • ...and 3 more figures