Table of Contents
Fetching ...

Robust Dynamic Object Detection in Cluttered Indoor Scenes via Learned Spatiotemporal Cues

Juan Rached, Yixuan Jia, Kota Kondo, Jonathan P. How

Abstract

Reliable dynamic object detection in cluttered environments remains a critical challenge for autonomous navigation. Purely geometric LiDAR pipelines that rely on clustering and heuristic filtering can miss dynamic obstacles when they move in close proximity to static structure or are only partially observed. Vision-augmented approaches can provide additional semantic cues, but are often limited by closed-set detectors and camera field-of-view constraints, reducing robustness to novel obstacles and out-of-frustum events. In this work, we present a LiDAR-only framework that fuses temporal occupancy-grid-based motion segmentation with a learned bird's-eye-view (BEV) dynamic prior. A fusion module prioritizes 3D detections when available, while using the learned dynamic grid to recover detections that would otherwise be lost due to proximity-induced false negatives. Experiments with motion-capture ground truth show our method achieves 28.67% higher recall and 18.50% higher F1 score than the state-of-the-art in substantially cluttered environments while maintaining comparable precision and position error.

Robust Dynamic Object Detection in Cluttered Indoor Scenes via Learned Spatiotemporal Cues

Abstract

Reliable dynamic object detection in cluttered environments remains a critical challenge for autonomous navigation. Purely geometric LiDAR pipelines that rely on clustering and heuristic filtering can miss dynamic obstacles when they move in close proximity to static structure or are only partially observed. Vision-augmented approaches can provide additional semantic cues, but are often limited by closed-set detectors and camera field-of-view constraints, reducing robustness to novel obstacles and out-of-frustum events. In this work, we present a LiDAR-only framework that fuses temporal occupancy-grid-based motion segmentation with a learned bird's-eye-view (BEV) dynamic prior. A fusion module prioritizes 3D detections when available, while using the learned dynamic grid to recover detections that would otherwise be lost due to proximity-induced false negatives. Experiments with motion-capture ground truth show our method achieves 28.67% higher recall and 18.50% higher F1 score than the state-of-the-art in substantially cluttered environments while maintaining comparable precision and position error.
Paper Structure (19 sections, 10 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 10 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: BEV of LV-DOTxu2025lv (top row), Dynabloxschmid2023dynablox (middle row), and STORM (bottom row) across three experiments with increasing levels of clutter. LV-DOT struggles with proximity-induced false negatives in all experiments. Dynablox misses detections due to partial occlusions in columns 1 and 3 and due to proximity-induced false negatives in column 2. STORM detects all obstacles across all experiments.
  • Figure 2: System architecture of the proposed detection framework: point cloud and ego state information are processed in parallel by our OGM and learning-based MOS pipelines. The cluster centroids of the dynamic point cloud produced by the OGM module and GridNet's 2D dynamic grids are fused to generate dynamic obstacle detections.
  • Figure 3: Precision, recall, and F1 score in an experiment where a pedestrian traverses a room with 21 static obstacles. The dense obstacle configuration forces frequent strong occlusions and proximity-induced false negatives. STORM recovers information lost during those failure modes, achieving higher recall and F1 score than other methods.
  • Figure 4: Precision, recall, and F1 score in an experiment where a quadcopter flies at high speeds (0-5 m/s) in an empty room. STORM captures the high speed motion of the vehicle, achieving significantly higher recall and F1 score than other methods with comparable precision.
  • Figure 5: Experimental setup with 31 static obstacles. Quadcopter is pictured at the center of the room, surrounded by tables and foam pillars. The dense obstacle arrangement generates multiple partial occlusions and narrow corridors that clustering-based systems struggle with.