Self-Supervised Bird's Eye View Motion Prediction with Cross-Modality Signals
Shaoheng Fang, Zuhong Liu, Mingyu Wang, Chenxin Xu, Yiqi Zhong, Siheng Chen
TL;DR
The paper tackles dense BEV motion prediction in autonomous driving under minimal supervision by proposing a cross-modality self-supervised framework that leverages sequential camera images to supervise LiDAR-based BEV motion learning. It introduces three supervision signals: a masked Chamfer distance derived from a pseudo static/dynamic mask, a piecewise rigidity loss obtained via image-space over-segmentation and cross-view projection, and a temporal consistency loss across frames, with an overall loss combining these terms. The approach achieves state-of-the-art performance among self-supervised methods on NuScenes, with substantial gains over prior self-supervision and competitive results against weakly and fully supervised baselines, and it generalizes to Argoverse2. Inference remains efficient, as only point-cloud sequences are required at test time, making the method practical for real-world deployment and data-efficient learning from unlabeled data.
Abstract
Learning the dense bird's eye view (BEV) motion flow in a self-supervised manner is an emerging research for robotics and autonomous driving. Current self-supervised methods mainly rely on point correspondences between point clouds, which may introduce the problems of fake flow and inconsistency, hindering the model's ability to learn accurate and realistic motion. In this paper, we introduce a novel cross-modality self-supervised training framework that effectively addresses these issues by leveraging multi-modality data to obtain supervision signals. We design three innovative supervision signals to preserve the inherent properties of scene motion, including the masked Chamfer distance loss, the piecewise rigidity loss, and the temporal consistency loss. Through extensive experiments, we demonstrate that our proposed self-supervised framework outperforms all previous self-supervision methods for the motion prediction task.
