Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework

Jiandong Jin; Xiao Wang; Qian Zhu; Haiyang Wang; Chenglong Li

Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework

Jiandong Jin, Xiao Wang, Qian Zhu, Haiyang Wang, Chenglong Li

TL;DR

Pedestrian Attribute Recognition (PAR) faces dataset saturation and limited cross-domain generalization. The authors introduce MSP60K, a large-scale cross-domain PAR benchmark with 60,122 images, 57 attributes across eight scenarios, and synthetic degradations to mimic real-world conditions, and they evaluate 17 PAR models under random and cross-domain splits. They also propose LLM-PAR, a two-branch framework combining a ViT-based visual classifier with a Multi-Embedding Query Transformer and a Large Language Model augmentation path, fused via mean pooling (and other strategies) to enhance attribute inference with language-based reasoning. Across MSP60K and public PAR benchmarks, LLM-PAR achieves state-of-the-art performance, demonstrating the value of integrating structured visual features with LLM-driven reasoning for robust pedestrian attribute recognition. The MSP60K dataset and code are publicly available to facilitate future research and practical deployment.

Abstract

Pedestrian Attribute Recognition (PAR) is one of the indispensable tasks in human-centered research. However, existing datasets neglect different domains (e.g., environments, times, populations, and data sources), only conducting simple random splits, and the performance of these datasets has already approached saturation. In the past five years, no large-scale dataset has been opened to the public. To address this issue, this paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset to fill the data gap, termed MSP60K. It consists of 60,122 images and 57 attribute annotations across eight scenarios. Synthetic degradation is also conducted to further narrow the gap between the dataset and real-world challenging scenarios. To establish a more rigorous benchmark, we evaluate 17 representative PAR models under both random and cross-domain split protocols on our dataset. Additionally, we propose an innovative Large Language Model (LLM) augmented PAR framework, named LLM-PAR. This framework processes pedestrian images through a Vision Transformer (ViT) backbone to extract features and introduces a multi-embedding query Transformer to learn partial-aware features for attribute classification. Significantly, we enhance this framework with LLM for ensemble learning and visual feature augmentation. Comprehensive experiments across multiple PAR benchmark datasets have thoroughly validated the efficacy of our proposed framework. The dataset and source code accompanying this paper will be made publicly available at \url{https://github.com/Event-AHU/OpenPAR}.

Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework

TL;DR

Abstract

Paper Structure (24 sections, 6 equations, 8 figures, 9 tables)

This paper contains 24 sections, 6 equations, 8 figures, 9 tables.

Introduction
Related Works
Pedestrian Attribute Recognition
Benchmark Datasets for PAR
Vision-Language Models
MSP60K Benchmark Dataset
Protocols
Attribute Groups and Details
Statistical Analysis
Benchmark Baselines
Methodology
Overview
Multi-Label Classification Branch
Large Language Model Branch
Model Aggregation for PAR
...and 9 more sections

Figures (8)

Figure 1: (a, b). Comparison between existing PAR datasets and our newly proposed MSP60K dataset. (c). Illustrates the synthetic degradation challenges we employed in our dataset to simulate the complex and dynamic real-world environment.
Figure 2: (a) Attributes Distribution: Bar graph showing the prevalence of individual attributes across the dataset; (b) Co-occurrence Matrix of Attributes: Logarithmic heatmap showing the co-occurrence frequency of attribute pairs; (c) Attributes Distribution in Different Scenes: Circular chart illustrating attribute distribution across eight different scenes.
Figure 3: An illustration of representative samples in our newly proposed MSP60K PAR dataset.
Figure 4: T-SNE visualization of scene samples in the MSP60K PAR dataset. Each colored cluster represents samples from different scenes, including "Market," "Ski Resort," "Outdoor1," "School," "Outdoor3," "Outdoor2," "Construction Site", and "Kitchens". For each scene, a pie chart is overlaid to illustrate the attribute distribution within that cluster. The legend on the right provides a detailed list of all attributes.
Figure 5: An illustration of our proposed LLM-PAR framework illustrates how we use Multimodal Large Language Models (MLLMs) for deep semantic reasoning, combining images and descriptive text to provide more interpretable visual understanding. Through this framework, we can recognize pedestrian attributes and generate natural language descriptions, thereby offering more intuitive explanations. Our framework consists of three parts: visual feature extraction, language description generation, and language-enhanced classification.
...and 3 more figures

Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework

TL;DR

Abstract

Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework

Authors

TL;DR

Abstract

Table of Contents

Figures (8)