Table of Contents
Fetching ...

Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion

Xiao Wang, Jiandong Jin, Chenglong Li, Jin Tang, Cheng Zhang, Wei Wang

TL;DR

This paper formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels, and proposes the region-aware prompt tuning technique to adjust very few parameters and fix both the pre-trained VL model and multi-modal Transformer.

Abstract

Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained CNN (e.g., ResNet) as their backbone network for visual feature learning, which might obtain sub-optimal results due to the insufficient employment of the relations between pedestrian images and attribute labels. In this paper, we formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels. Specifically, the attribute phrases are first expanded into sentences, and then the pre-trained vision-language model CLIP is adopted as our backbone for feature embedding of visual images and attribute descriptions. The contrastive learning objective connects the vision and language modalities well in the CLIP-based feature space, and the Transformer layers used in CLIP can capture the long-range relations between pixels. Then, a multi-modal Transformer is adopted to fuse the dual features effectively and feed-forward network is used to predict attributes. To optimize our network efficiently, we propose the region-aware prompt tuning technique to adjust very few parameters (i.e., only the prompt vectors and classification heads) and fix both the pre-trained VL model and multi-modal Transformer. Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy. It also achieves new state-of-the-art performance on both standard and zero-shot settings for PAR, including RAPv1, RAPv2, WIDER, PA100K, and PETA-ZS, RAP-ZS datasets. The source code and pre-trained models will be released on https://github.com/Event-AHU/OpenPAR.

Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion

TL;DR

This paper formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels, and proposes the region-aware prompt tuning technique to adjust very few parameters and fix both the pre-trained VL model and multi-modal Transformer.

Abstract

Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained CNN (e.g., ResNet) as their backbone network for visual feature learning, which might obtain sub-optimal results due to the insufficient employment of the relations between pedestrian images and attribute labels. In this paper, we formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels. Specifically, the attribute phrases are first expanded into sentences, and then the pre-trained vision-language model CLIP is adopted as our backbone for feature embedding of visual images and attribute descriptions. The contrastive learning objective connects the vision and language modalities well in the CLIP-based feature space, and the Transformer layers used in CLIP can capture the long-range relations between pixels. Then, a multi-modal Transformer is adopted to fuse the dual features effectively and feed-forward network is used to predict attributes. To optimize our network efficiently, we propose the region-aware prompt tuning technique to adjust very few parameters (i.e., only the prompt vectors and classification heads) and fix both the pre-trained VL model and multi-modal Transformer. Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy. It also achieves new state-of-the-art performance on both standard and zero-shot settings for PAR, including RAPv1, RAPv2, WIDER, PA100K, and PETA-ZS, RAP-ZS datasets. The source code and pre-trained models will be released on https://github.com/Event-AHU/OpenPAR.
Paper Structure (17 sections, 9 equations, 10 figures, 9 tables)

This paper contains 17 sections, 9 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Comparison of CNN-based, RNN-based, Transformer-based, and our newly proposed CLIP guided vision-language fusion frameworks for pedestrian attribute recognition.
  • Figure 2: An illustration of CLIP model radford2021CLIP. It takes a batch of image-text pairs as input and encodes their features using ResNet/Transformer network. The contrastive learning loss is built upon the vision-language paired or unpaired samples. It shows a great zero-shot transfer learning ability on many downstream tasks.
  • Figure 3: An illustration of our proposed PromptPAR framework which takes the pedestrian image and pre-defined attribute set as input and models the PAR task as a vision-language fusion problem. It contains three main modules, including the CLIP visual encoder, CLIP textual encoder, multi-modal Transformer (MM-Former), and classification head. The utilization of CLIP encoders brings us a better feature representation and the MM-Former outputs a unified feature representation for attribute classification. More importantly, we adopt prompt tuning to optimize very few network parameters, in other words, only the prompt vectors and classification head are tunable. Extensive experiments demonstrate the efficiency and effectiveness of our proposed PAR framework.
  • Figure 4: Average precision of 20 pedestrian attributes on the RAP-V1 dataset.
  • Figure 5: Experimental results of different lengths and frequencies of prompt on RAP-v1 dataset.
  • ...and 5 more figures