Table of Contents
Fetching ...

Self-Supervised Bird's Eye View Motion Prediction with Cross-Modality Signals

Shaoheng Fang, Zuhong Liu, Mingyu Wang, Chenxin Xu, Yiqi Zhong, Siheng Chen

TL;DR

The paper tackles dense BEV motion prediction in autonomous driving under minimal supervision by proposing a cross-modality self-supervised framework that leverages sequential camera images to supervise LiDAR-based BEV motion learning. It introduces three supervision signals: a masked Chamfer distance derived from a pseudo static/dynamic mask, a piecewise rigidity loss obtained via image-space over-segmentation and cross-view projection, and a temporal consistency loss across frames, with an overall loss combining these terms. The approach achieves state-of-the-art performance among self-supervised methods on NuScenes, with substantial gains over prior self-supervision and competitive results against weakly and fully supervised baselines, and it generalizes to Argoverse2. Inference remains efficient, as only point-cloud sequences are required at test time, making the method practical for real-world deployment and data-efficient learning from unlabeled data.

Abstract

Learning the dense bird's eye view (BEV) motion flow in a self-supervised manner is an emerging research for robotics and autonomous driving. Current self-supervised methods mainly rely on point correspondences between point clouds, which may introduce the problems of fake flow and inconsistency, hindering the model's ability to learn accurate and realistic motion. In this paper, we introduce a novel cross-modality self-supervised training framework that effectively addresses these issues by leveraging multi-modality data to obtain supervision signals. We design three innovative supervision signals to preserve the inherent properties of scene motion, including the masked Chamfer distance loss, the piecewise rigidity loss, and the temporal consistency loss. Through extensive experiments, we demonstrate that our proposed self-supervised framework outperforms all previous self-supervision methods for the motion prediction task.

Self-Supervised Bird's Eye View Motion Prediction with Cross-Modality Signals

TL;DR

The paper tackles dense BEV motion prediction in autonomous driving under minimal supervision by proposing a cross-modality self-supervised framework that leverages sequential camera images to supervise LiDAR-based BEV motion learning. It introduces three supervision signals: a masked Chamfer distance derived from a pseudo static/dynamic mask, a piecewise rigidity loss obtained via image-space over-segmentation and cross-view projection, and a temporal consistency loss across frames, with an overall loss combining these terms. The approach achieves state-of-the-art performance among self-supervised methods on NuScenes, with substantial gains over prior self-supervision and competitive results against weakly and fully supervised baselines, and it generalizes to Argoverse2. Inference remains efficient, as only point-cloud sequences are required at test time, making the method practical for real-world deployment and data-efficient learning from unlabeled data.

Abstract

Learning the dense bird's eye view (BEV) motion flow in a self-supervised manner is an emerging research for robotics and autonomous driving. Current self-supervised methods mainly rely on point correspondences between point clouds, which may introduce the problems of fake flow and inconsistency, hindering the model's ability to learn accurate and realistic motion. In this paper, we introduce a novel cross-modality self-supervised training framework that effectively addresses these issues by leveraging multi-modality data to obtain supervision signals. We design three innovative supervision signals to preserve the inherent properties of scene motion, including the masked Chamfer distance loss, the piecewise rigidity loss, and the temporal consistency loss. Through extensive experiments, we demonstrate that our proposed self-supervised framework outperforms all previous self-supervision methods for the motion prediction task.
Paper Structure (23 sections, 17 equations, 8 figures, 4 tables)

This paper contains 23 sections, 17 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Problems in current self-supervised motion learning methods that rely on point correspondence. (a) For static objects (background building), points with correspondences in the point cloud sequence may have completely different locations, misleading the model to learn the fake flow. (b) Due to the sparse nature of the point cloud, points within an instance may learn highly varying flow.
  • Figure 2: An overview of our cross-modality self-supervision learning framework. An overview of our cross-modality self-supervision learning framework. For self-supervised training, we introduce three innovative self-supervised losses that align with real-world motion patterns. The inference process only takes the point cloud sequence as input and predicts the motion flow of each BEV cell (grey area).
  • Figure 3: Rigid piece generation. (a) A frame of sequential images; (b) Over-segmentation on the optical flow image; (c) Over-segmentation projected to the associated point cloud; (d) Rigid pieces after fusion. In (c) and (d), each color refers to a piece.
  • Figure 4: An example of the generated static/dynamic mask and the rigid piece labels. Left: green represents dynamic points while black represents static points; Right: each color except black refers to a rigid piece label.
  • Figure 5: Visualizations of the pseudo static/dynamic mask and rigid piece labels. A good case.
  • ...and 3 more figures