Table of Contents
Fetching ...

Instantaneous Perception of Moving Objects in 3D

Di Liu, Bingbing Zhuang, Dimitris N. Metaxas, Manmohan Chandraker

TL;DR

This work tackles instantaneous perception of subtle 3D motion around autonomous vehicles using sparse LiDAR by introducing S'More, a framework that densifies object surfaces through local occupancy completion over a short window of five frames to mitigate the swimming artifact. It jointly performs occupancy completion, moving-object detection, and per-voxel motion flow estimation, supervised by warping nearby frames to the current frame and constrained to visible surfaces; the model runs at about $27$ fps for multiple objects. A dedicated subtle-motion benchmark derived from Waymo, with ground-truth motion computed from 3D box annotations and a minimal motion threshold $f_{min} < 0.2$ m, demonstrates substantial improvements in EPE, angular error, and F1 over ICP-based and scene-flow baselines. Ablation studies confirm the critical role of occupancy completion, and analyses show strong performance for small motions while maintaining competitive large-motion results. The results highlight the practical importance of instantaneous subtle-motion perception for driving safety, though the approach currently focuses on vehicles and faces challenges under extreme sparsity or occlusion.

Abstract

The perception of 3D motion of surrounding traffic participants is crucial for driving safety. While existing works primarily focus on general large motions, we contend that the instantaneous detection and quantification of subtle motions is equally important as they indicate the nuances in driving behavior that may be safety critical, such as behaviors near a stop sign of parking positions. We delve into this under-explored task, examining its unique challenges and developing our solution, accompanied by a carefully designed benchmark. Specifically, due to the lack of correspondences between consecutive frames of sparse Lidar point clouds, static objects might appear to be moving - the so-called swimming effect. This intertwines with the true object motion, thereby posing ambiguity in accurate estimation, especially for subtle motions. To address this, we propose to leverage local occupancy completion of object point clouds to densify the shape cue, and mitigate the impact of swimming artifacts. The occupancy completion is learned in an end-to-end fashion together with the detection of moving objects and the estimation of their motion, instantaneously as soon as objects start to move. Extensive experiments demonstrate superior performance compared to standard 3D motion estimation approaches, particularly highlighting our method's specialized treatment of subtle motions.

Instantaneous Perception of Moving Objects in 3D

TL;DR

This work tackles instantaneous perception of subtle 3D motion around autonomous vehicles using sparse LiDAR by introducing S'More, a framework that densifies object surfaces through local occupancy completion over a short window of five frames to mitigate the swimming artifact. It jointly performs occupancy completion, moving-object detection, and per-voxel motion flow estimation, supervised by warping nearby frames to the current frame and constrained to visible surfaces; the model runs at about fps for multiple objects. A dedicated subtle-motion benchmark derived from Waymo, with ground-truth motion computed from 3D box annotations and a minimal motion threshold m, demonstrates substantial improvements in EPE, angular error, and F1 over ICP-based and scene-flow baselines. Ablation studies confirm the critical role of occupancy completion, and analyses show strong performance for small motions while maintaining competitive large-motion results. The results highlight the practical importance of instantaneous subtle-motion perception for driving safety, though the approach currently focuses on vehicles and faces challenges under extreme sparsity or occlusion.

Abstract

The perception of 3D motion of surrounding traffic participants is crucial for driving safety. While existing works primarily focus on general large motions, we contend that the instantaneous detection and quantification of subtle motions is equally important as they indicate the nuances in driving behavior that may be safety critical, such as behaviors near a stop sign of parking positions. We delve into this under-explored task, examining its unique challenges and developing our solution, accompanied by a carefully designed benchmark. Specifically, due to the lack of correspondences between consecutive frames of sparse Lidar point clouds, static objects might appear to be moving - the so-called swimming effect. This intertwines with the true object motion, thereby posing ambiguity in accurate estimation, especially for subtle motions. To address this, we propose to leverage local occupancy completion of object point clouds to densify the shape cue, and mitigate the impact of swimming artifacts. The occupancy completion is learned in an end-to-end fashion together with the detection of moving objects and the estimation of their motion, instantaneously as soon as objects start to move. Extensive experiments demonstrate superior performance compared to standard 3D motion estimation approaches, particularly highlighting our method's specialized treatment of subtle motions.
Paper Structure (14 sections, 4 equations, 9 figures, 5 tables)

This paper contains 14 sections, 4 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Illustration of Instantaneous Motion Perception. We visualize motion of three objects from Waymo dataset sun2020scalability, each with three consecutive frames. Objects in fast and subtle motions are marked as red and blue, respectively, with arrow length indicating motion magnitude. While standard motion detection handles general large motions such as (a), we focus on instantaneous perception of subtle motions that may indicate changes in driving behavior, for example (b) parking car starts to move, and (c) forward moving car stops and reverses. The visualized subtle motions (b)(c) are output from our framework. We also provide the video visualization in supplementary.
  • Figure 2: Overview of S'More. Given a sequential LiDAR point cloud, we first identify objects of interest by filtering out background and objects with large motion. We then voxelize the point cloud for each object, extract features through an encoder-decoder network, and perform occupancy completion. The output is passed to a motion detector and a motion flow predictor for final detection and estimation.
  • Figure 3: Integration of S'More with 3D tracking systems.
  • Figure 4: Illustration of Swimming Effect on ground (a) and static object (b), and (c) our ground truth occupancy completion for (b). Bule and red points indicate points from two frames.
  • Figure 5: Qualitative Comparison. We exhibit point cloud registration results for two point cloud sets: the first frame (in red) and the last frame (in green). The results are shown using (b) ground truth motion, and estimated motions by (c) S'More (ours), (d) FastNSF, and (e) ICP. The blue points indicate resultant positions after adding flow to the red points, which should ideally align with the green points.
  • ...and 4 more figures