Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework
Jiandong Jin, Xiao Wang, Qian Zhu, Haiyang Wang, Chenglong Li
TL;DR
Pedestrian Attribute Recognition (PAR) faces dataset saturation and limited cross-domain generalization. The authors introduce MSP60K, a large-scale cross-domain PAR benchmark with 60,122 images, 57 attributes across eight scenarios, and synthetic degradations to mimic real-world conditions, and they evaluate 17 PAR models under random and cross-domain splits. They also propose LLM-PAR, a two-branch framework combining a ViT-based visual classifier with a Multi-Embedding Query Transformer and a Large Language Model augmentation path, fused via mean pooling (and other strategies) to enhance attribute inference with language-based reasoning. Across MSP60K and public PAR benchmarks, LLM-PAR achieves state-of-the-art performance, demonstrating the value of integrating structured visual features with LLM-driven reasoning for robust pedestrian attribute recognition. The MSP60K dataset and code are publicly available to facilitate future research and practical deployment.
Abstract
Pedestrian Attribute Recognition (PAR) is one of the indispensable tasks in human-centered research. However, existing datasets neglect different domains (e.g., environments, times, populations, and data sources), only conducting simple random splits, and the performance of these datasets has already approached saturation. In the past five years, no large-scale dataset has been opened to the public. To address this issue, this paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset to fill the data gap, termed MSP60K. It consists of 60,122 images and 57 attribute annotations across eight scenarios. Synthetic degradation is also conducted to further narrow the gap between the dataset and real-world challenging scenarios. To establish a more rigorous benchmark, we evaluate 17 representative PAR models under both random and cross-domain split protocols on our dataset. Additionally, we propose an innovative Large Language Model (LLM) augmented PAR framework, named LLM-PAR. This framework processes pedestrian images through a Vision Transformer (ViT) backbone to extract features and introduces a multi-embedding query Transformer to learn partial-aware features for attribute classification. Significantly, we enhance this framework with LLM for ensemble learning and visual feature augmentation. Comprehensive experiments across multiple PAR benchmark datasets have thoroughly validated the efficacy of our proposed framework. The dataset and source code accompanying this paper will be made publicly available at \url{https://github.com/Event-AHU/OpenPAR}.
