When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset
Yi Zhang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, Wentao Liu
TL;DR
The paper tackles the challenge of robust pedestrian detection across diverse sensor modalities by introducing a generalist, multimodal detector (MMPedestron) and a large-scale benchmark (MMPD) including a new RGB–Event dataset (EventPed). It presents a unified transformer-based encoder with modality-aware fusion via two learnable tokens (MAA and MAF) and a modular unifier that enables effective detection across arbitrary modality combinations. Through two-stage training and modality dropout, the approach achieves state-of-the-art results on multiple benchmarks (e.g., COCO-Persons, LLVIP) and strong cross-dataset transfer, while maintaining a compact parameter footprint relative to modality-specific models. The work provides a scalable, generalizable framework for multimodal pedestrian perception with practical implications for autonomous driving, surveillance, and robotics.
Abstract
Recent years have witnessed increasing research attention towards pedestrian detection by taking the advantages of different sensor modalities (e.g. RGB, IR, Depth, LiDAR and Event). However, designing a unified generalist model that can effectively process diverse sensor modalities remains a challenge. This paper introduces MMPedestron, a novel generalist model for multimodal perception. Unlike previous specialist models that only process one or a pair of specific modality inputs, MMPedestron is able to process multiple modal inputs and their dynamic combinations. The proposed approach comprises a unified encoder for modal representation and fusion and a general head for pedestrian detection. We introduce two extra learnable tokens, i.e. MAA and MAF, for adaptive multi-modal feature fusion. In addition, we construct the MMPD dataset, the first large-scale benchmark for multi-modal pedestrian detection. This benchmark incorporates existing public datasets and a newly collected dataset called EventPed, covering a wide range of sensor modalities including RGB, IR, Depth, LiDAR, and Event data. With multi-modal joint training, our model achieves state-of-the-art performance on a wide range of pedestrian detection benchmarks, surpassing leading models tailored for specific sensor modality. For example, it achieves 71.1 AP on COCO-Persons and 72.6 AP on LLVIP. Notably, our model achieves comparable performance to the InternImage-H model on CrowdHuman with 30x smaller parameters. Codes and data are available at https://github.com/BubblyYi/MMPedestron.
