Table of Contents
Fetching ...

A Comparative Study of 3D Person Detection: Sensor Modalities and Robustness in Diverse Indoor and Outdoor Environments

Malaz Tamim, Andrea Matic-Flierl, Karsten Roscher

TL;DR

The paper evaluates 3D person detection across camera-only, LiDAR-only, and camera-LiDAR fusion models on the JRDB indoor-outdoor dataset, focusing on robustness to occlusion, distance, and synthetic sensor corruptions. Using BEVDepth, PointPillars, and DAL as baselines, it shows that fusion consistently outperforms single modalities, especially in challenging scenarios, though fusion remains sensitive to misalignment and certain LiDAR corruptions. The study systematically analyzes sensor-level, misalignment, and weather-like corruptions, demonstrating that LiDAR-driven localization largely preserves performance under camera distortions, while camera-only approaches suffer drastic drops under noise and occlusion. The results underscore the value of sensor fusion for reliable 3D person detection in non-automotive domains and highlight concrete vulnerability areas to guide future robustness enhancements and cross-domain evaluations.

Abstract

Accurate 3D person detection is critical for safety in applications such as robotics, industrial monitoring, and surveillance. This work presents a systematic evaluation of 3D person detection using camera-only, LiDAR-only, and camera-LiDAR fusion. While most existing research focuses on autonomous driving, we explore detection performance and robustness in diverse indoor and outdoor scenes using the JRDB dataset. We compare three representative models - BEVDepth (camera), PointPillars (LiDAR), and DAL (camera-LiDAR fusion) - and analyze their behavior under varying occlusion and distance levels. Our results show that the fusion-based approach consistently outperforms single-modality models, particularly in challenging scenarios. We further investigate robustness against sensor corruptions and misalignments, revealing that while DAL offers improved resilience, it remains sensitive to sensor misalignment and certain LiDAR-based corruptions. In contrast, the camera-based BEVDepth model showed the lowest performance and was most affected by occlusion, distance, and noise. Our findings highlight the importance of utilizing sensor fusion for enhanced 3D person detection, while also underscoring the need for ongoing research to address the vulnerabilities inherent in these systems.

A Comparative Study of 3D Person Detection: Sensor Modalities and Robustness in Diverse Indoor and Outdoor Environments

TL;DR

The paper evaluates 3D person detection across camera-only, LiDAR-only, and camera-LiDAR fusion models on the JRDB indoor-outdoor dataset, focusing on robustness to occlusion, distance, and synthetic sensor corruptions. Using BEVDepth, PointPillars, and DAL as baselines, it shows that fusion consistently outperforms single modalities, especially in challenging scenarios, though fusion remains sensitive to misalignment and certain LiDAR corruptions. The study systematically analyzes sensor-level, misalignment, and weather-like corruptions, demonstrating that LiDAR-driven localization largely preserves performance under camera distortions, while camera-only approaches suffer drastic drops under noise and occlusion. The results underscore the value of sensor fusion for reliable 3D person detection in non-automotive domains and highlight concrete vulnerability areas to guide future robustness enhancements and cross-domain evaluations.

Abstract

Accurate 3D person detection is critical for safety in applications such as robotics, industrial monitoring, and surveillance. This work presents a systematic evaluation of 3D person detection using camera-only, LiDAR-only, and camera-LiDAR fusion. While most existing research focuses on autonomous driving, we explore detection performance and robustness in diverse indoor and outdoor scenes using the JRDB dataset. We compare three representative models - BEVDepth (camera), PointPillars (LiDAR), and DAL (camera-LiDAR fusion) - and analyze their behavior under varying occlusion and distance levels. Our results show that the fusion-based approach consistently outperforms single-modality models, particularly in challenging scenarios. We further investigate robustness against sensor corruptions and misalignments, revealing that while DAL offers improved resilience, it remains sensitive to sensor misalignment and certain LiDAR-based corruptions. In contrast, the camera-based BEVDepth model showed the lowest performance and was most affected by occlusion, distance, and noise. Our findings highlight the importance of utilizing sensor fusion for enhanced 3D person detection, while also underscoring the need for ongoing research to address the vulnerabilities inherent in these systems.
Paper Structure (28 sections, 4 figures, 3 tables)

This paper contains 28 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Visualization of typical camera and LiDAR corruptions. The top row shows one of the five camera views of the multi-view setup in JRDB; the bottom row shows the corresponding 360° LiDAR point cloud with ground-truth boxes in red. Modalities are annotated as C (camera) and L (LiDAR).
  • Figure 2: Comparison of AP$_{0.3}$ (top) and AP$_{0.5}$ (bottom) by distance categories: near, mid, and far for BEVDepth, PointPillars, and DAL.
  • Figure 3: Comparison of AP$_{0.3}$ (top) and AP$_{0.5}$ (bottom) across unoccluded, partially occluded, and heavily occluded categories for BEVDepth, PointPillars, and DAL.
  • Figure 4: Comparison of AP$_{0.3}$ (top) and AP$_{0.5}$ (bottom) across combined distance and occlusion categories for BEVDepth, PointPillars, and DAL.