Table of Contents
Fetching ...

Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition

Xiao Wang, Qian Zhu, Jiandong Jin, Jun Zhu, Futian Wang, Bo Jiang, Yaowei Wang, Yonghong Tian

TL;DR

The paper tackles video-based pedestrian attribute recognition by reframing it as a vision-language fusion problem and leveraging a fixed CLIP backbone augmented with lightweight spatiotemporal side networks. The proposed VTFPAR++ employs prompt-enhanced attribute descriptions, a multi-modal Transformer for fusion, and a weighted cross-entropy loss to handle imbalance, achieving state-of-the-art results on MARS-Attribute and DukeMTMC-VID-Attribute with substantially fewer tunable parameters. The work demonstrates that parameter-efficient, spatiotemporal tuning can robustly exploit video information for fine-grained attribute recognition, offering practical benefits in memory and computation. Overall, the approach advances video PAR by combining strong cross-modal representations with efficient adaptation, enabling more reliable performance under occlusion and motion blur in real-world scenarios.

Abstract

Existing pedestrian attribute recognition (PAR) algorithms are mainly developed based on a static image, however, the performance is unreliable in challenging scenarios, such as heavy occlusion, motion blur, etc. In this work, we propose to understand human attributes using video frames that can fully use temporal information by fine-tuning a pre-trained multi-modal foundation model efficiently. Specifically, we formulate the video-based PAR as a vision-language fusion problem and adopt a pre-trained foundation model CLIP to extract the visual features. More importantly, we propose a novel spatiotemporal side-tuning strategy to achieve parameter-efficient optimization of the pre-trained vision foundation model. To better utilize the semantic information, we take the full attribute list that needs to be recognized as another input and transform the attribute words/phrases into the corresponding sentence via split, expand, and prompt operations. Then, the text encoder of CLIP is utilized for embedding processed attribute descriptions. The averaged visual tokens and text tokens are concatenated and fed into a fusion Transformer for multi-modal interactive learning. The enhanced tokens will be fed into a classification head for pedestrian attribute prediction. Extensive experiments on two large-scale video-based PAR datasets fully validated the effectiveness of our proposed framework. The source code of this paper is available at https://github.com/Event-AHU/OpenPAR.

Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition

TL;DR

The paper tackles video-based pedestrian attribute recognition by reframing it as a vision-language fusion problem and leveraging a fixed CLIP backbone augmented with lightweight spatiotemporal side networks. The proposed VTFPAR++ employs prompt-enhanced attribute descriptions, a multi-modal Transformer for fusion, and a weighted cross-entropy loss to handle imbalance, achieving state-of-the-art results on MARS-Attribute and DukeMTMC-VID-Attribute with substantially fewer tunable parameters. The work demonstrates that parameter-efficient, spatiotemporal tuning can robustly exploit video information for fine-grained attribute recognition, offering practical benefits in memory and computation. Overall, the approach advances video PAR by combining strong cross-modal representations with efficient adaptation, enabling more reliable performance under occlusion and motion blur in real-world scenarios.

Abstract

Existing pedestrian attribute recognition (PAR) algorithms are mainly developed based on a static image, however, the performance is unreliable in challenging scenarios, such as heavy occlusion, motion blur, etc. In this work, we propose to understand human attributes using video frames that can fully use temporal information by fine-tuning a pre-trained multi-modal foundation model efficiently. Specifically, we formulate the video-based PAR as a vision-language fusion problem and adopt a pre-trained foundation model CLIP to extract the visual features. More importantly, we propose a novel spatiotemporal side-tuning strategy to achieve parameter-efficient optimization of the pre-trained vision foundation model. To better utilize the semantic information, we take the full attribute list that needs to be recognized as another input and transform the attribute words/phrases into the corresponding sentence via split, expand, and prompt operations. Then, the text encoder of CLIP is utilized for embedding processed attribute descriptions. The averaged visual tokens and text tokens are concatenated and fed into a fusion Transformer for multi-modal interactive learning. The enhanced tokens will be fed into a classification head for pedestrian attribute prediction. Extensive experiments on two large-scale video-based PAR datasets fully validated the effectiveness of our proposed framework. The source code of this paper is available at https://github.com/Event-AHU/OpenPAR.
Paper Structure (18 sections, 9 equations, 9 figures, 7 tables)

This paper contains 18 sections, 9 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: (a, b). Comparison between the RGB frame-based and video-based pedestrian attribute recognition; (c-f). Comparison of memory consumption, F1 score, time cost, and tunable parameters between existing PEFT (parameter-efficient fine-tuning) strategies and our newly proposed spatiotemporal side network tuning method.
  • Figure 2: An illustration of our proposed video-based pedestrian attribute recognition framework, termed VTFPAR++. It formulates video-based attribute recognition as a video-language fusion problem, which takes the pedestrian video and attribute set as the input. The pre-trained multi-modal foundation model CLIP is adopted as the basic feature extraction network. We further propose the lightweight spatiotemporal side network to aggregate the features from different Transformer layers and video frames. These features are fused into a unified representation via global average pooling operators. We process the given attributes into language descriptions via split, expand, and prompt engineering, and extract its features using CLIP text encoder. Then, we align the vision-language features using a fusion Transformer and classify the attributes via the attribute prediction head. Our framework requires lower GPU memory consumption, fewer parameter adjustments, and more efficient model training and deployment, yet still achieves leading attribute recognition accuracy on two public datasets.
  • Figure 3: An illustration of our proposed Spatial Side Network(SSN), SSN models the spatial relationship of multi-level CLIP visual features for each frame separately, and finally the modeling results of multiple frames are interacted with text features after GAP aggregation.
  • Figure 4: An illustration of our proposed Temporal Side Network(TSN). TSN primarily models temporal relationships of the same layer of CLIP visual features over multiple frames to mitigate the effects of challenges such as occlusion and blurring.
  • Figure 5: Comparison between our method and VTF at different frames on MARS datasets.
  • ...and 4 more figures