Instantaneous Perception of Moving Objects in 3D
Di Liu, Bingbing Zhuang, Dimitris N. Metaxas, Manmohan Chandraker
TL;DR
This work tackles instantaneous perception of subtle 3D motion around autonomous vehicles using sparse LiDAR by introducing S'More, a framework that densifies object surfaces through local occupancy completion over a short window of five frames to mitigate the swimming artifact. It jointly performs occupancy completion, moving-object detection, and per-voxel motion flow estimation, supervised by warping nearby frames to the current frame and constrained to visible surfaces; the model runs at about $27$ fps for multiple objects. A dedicated subtle-motion benchmark derived from Waymo, with ground-truth motion computed from 3D box annotations and a minimal motion threshold $f_{min} < 0.2$ m, demonstrates substantial improvements in EPE, angular error, and F1 over ICP-based and scene-flow baselines. Ablation studies confirm the critical role of occupancy completion, and analyses show strong performance for small motions while maintaining competitive large-motion results. The results highlight the practical importance of instantaneous subtle-motion perception for driving safety, though the approach currently focuses on vehicles and faces challenges under extreme sparsity or occlusion.
Abstract
The perception of 3D motion of surrounding traffic participants is crucial for driving safety. While existing works primarily focus on general large motions, we contend that the instantaneous detection and quantification of subtle motions is equally important as they indicate the nuances in driving behavior that may be safety critical, such as behaviors near a stop sign of parking positions. We delve into this under-explored task, examining its unique challenges and developing our solution, accompanied by a carefully designed benchmark. Specifically, due to the lack of correspondences between consecutive frames of sparse Lidar point clouds, static objects might appear to be moving - the so-called swimming effect. This intertwines with the true object motion, thereby posing ambiguity in accurate estimation, especially for subtle motions. To address this, we propose to leverage local occupancy completion of object point clouds to densify the shape cue, and mitigate the impact of swimming artifacts. The occupancy completion is learned in an end-to-end fashion together with the detection of moving objects and the estimation of their motion, instantaneously as soon as objects start to move. Extensive experiments demonstrate superior performance compared to standard 3D motion estimation approaches, particularly highlighting our method's specialized treatment of subtle motions.
