Table of Contents
Fetching ...

When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset

Yi Zhang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, Wentao Liu

TL;DR

The paper tackles the challenge of robust pedestrian detection across diverse sensor modalities by introducing a generalist, multimodal detector (MMPedestron) and a large-scale benchmark (MMPD) including a new RGB–Event dataset (EventPed). It presents a unified transformer-based encoder with modality-aware fusion via two learnable tokens (MAA and MAF) and a modular unifier that enables effective detection across arbitrary modality combinations. Through two-stage training and modality dropout, the approach achieves state-of-the-art results on multiple benchmarks (e.g., COCO-Persons, LLVIP) and strong cross-dataset transfer, while maintaining a compact parameter footprint relative to modality-specific models. The work provides a scalable, generalizable framework for multimodal pedestrian perception with practical implications for autonomous driving, surveillance, and robotics.

Abstract

Recent years have witnessed increasing research attention towards pedestrian detection by taking the advantages of different sensor modalities (e.g. RGB, IR, Depth, LiDAR and Event). However, designing a unified generalist model that can effectively process diverse sensor modalities remains a challenge. This paper introduces MMPedestron, a novel generalist model for multimodal perception. Unlike previous specialist models that only process one or a pair of specific modality inputs, MMPedestron is able to process multiple modal inputs and their dynamic combinations. The proposed approach comprises a unified encoder for modal representation and fusion and a general head for pedestrian detection. We introduce two extra learnable tokens, i.e. MAA and MAF, for adaptive multi-modal feature fusion. In addition, we construct the MMPD dataset, the first large-scale benchmark for multi-modal pedestrian detection. This benchmark incorporates existing public datasets and a newly collected dataset called EventPed, covering a wide range of sensor modalities including RGB, IR, Depth, LiDAR, and Event data. With multi-modal joint training, our model achieves state-of-the-art performance on a wide range of pedestrian detection benchmarks, surpassing leading models tailored for specific sensor modality. For example, it achieves 71.1 AP on COCO-Persons and 72.6 AP on LLVIP. Notably, our model achieves comparable performance to the InternImage-H model on CrowdHuman with 30x smaller parameters. Codes and data are available at https://github.com/BubblyYi/MMPedestron.

When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset

TL;DR

The paper tackles the challenge of robust pedestrian detection across diverse sensor modalities by introducing a generalist, multimodal detector (MMPedestron) and a large-scale benchmark (MMPD) including a new RGB–Event dataset (EventPed). It presents a unified transformer-based encoder with modality-aware fusion via two learnable tokens (MAA and MAF) and a modular unifier that enables effective detection across arbitrary modality combinations. Through two-stage training and modality dropout, the approach achieves state-of-the-art results on multiple benchmarks (e.g., COCO-Persons, LLVIP) and strong cross-dataset transfer, while maintaining a compact parameter footprint relative to modality-specific models. The work provides a scalable, generalizable framework for multimodal pedestrian perception with practical implications for autonomous driving, surveillance, and robotics.

Abstract

Recent years have witnessed increasing research attention towards pedestrian detection by taking the advantages of different sensor modalities (e.g. RGB, IR, Depth, LiDAR and Event). However, designing a unified generalist model that can effectively process diverse sensor modalities remains a challenge. This paper introduces MMPedestron, a novel generalist model for multimodal perception. Unlike previous specialist models that only process one or a pair of specific modality inputs, MMPedestron is able to process multiple modal inputs and their dynamic combinations. The proposed approach comprises a unified encoder for modal representation and fusion and a general head for pedestrian detection. We introduce two extra learnable tokens, i.e. MAA and MAF, for adaptive multi-modal feature fusion. In addition, we construct the MMPD dataset, the first large-scale benchmark for multi-modal pedestrian detection. This benchmark incorporates existing public datasets and a newly collected dataset called EventPed, covering a wide range of sensor modalities including RGB, IR, Depth, LiDAR, and Event data. With multi-modal joint training, our model achieves state-of-the-art performance on a wide range of pedestrian detection benchmarks, surpassing leading models tailored for specific sensor modality. For example, it achieves 71.1 AP on COCO-Persons and 72.6 AP on LLVIP. Notably, our model achieves comparable performance to the InternImage-H model on CrowdHuman with 30x smaller parameters. Codes and data are available at https://github.com/BubblyYi/MMPedestron.
Paper Structure (42 sections, 2 equations, 10 figures, 10 tables)

This paper contains 42 sections, 2 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: MMPedestron unifies diverse modality inputs, including RGB, IR, Event, Depth and LiDAR, for pedestrian detection.
  • Figure 2: Performance on diverse datasets and modalities. MMPedestron outperforms leading models trained on the specific dataset and modality.
  • Figure 3: Overview of our proposed MMPD benchmark. (a) It encompasses a wide range of modalities, such as RGB, IR, Depth, LiDAR, and Event. (b) It includes diverse scenarios, including person-centric v.s. crowd, outdoor v.s. indoor, day v.s. night scenes.
  • Figure 4: (a) MMPedestron consists of an unified multi-modal encoder and a detection head. Each stage of the encoder contains a modality-specific patch embedding layer, several transformer blocks and a modality unifier. The resulting unified tokens from multiple stages are fed into the detection head to produce detection results. (b) Modality unifier fuses multi-modal vision tokens with the guidance of MAF and incorporates the domain knowledge of MAA to the output unified tokens. For clarity, we show the case of two modalities.
  • Figure 5: Visualization of MAF (a, b) and MAA (c, d) tokens. (a,c) are for unimodal inputs, and (b,d) are for multi-modal inputs.
  • ...and 5 more figures