Table of Contents
Fetching ...

Visual Point Cloud Forecasting enables Scalable Autonomous Driving

Zetong Yang, Li Chen, Yanan Sun, Hongyang Li

TL;DR

This work tackles the scarcity of scalable pre-training for visual autonomous driving by introducing visual point cloud forecasting as a pre-text task. It presents ViDAR, a three-component framework that learns semantic, geometric, and temporal representations from historical Image-LiDAR sequences to forecast future point clouds and BEV features, enabling effective pre-training of visual encoders. A key contribution is Latent Rendering, which overcomes the limitations of differentiable ray-casting by producing discriminative 3D geometric latent space and is complemented by an autoregressive Future Decoder. Empirical results on nuScenes show that ViDAR improves downstream 3D detection, semantic occupancy, motion forecasting, and open-loop planning, demonstrating strong data efficiency and end-to-end benefits for scalable visual autonomous driving.

Abstract

In contrast to extensive studies on general vision, pre-training for scalable visual autonomous driving remains seldom explored. Visual autonomous driving applications require features encompassing semantics, 3D geometry, and temporal information simultaneously for joint perception, prediction, and planning, posing dramatic challenges for pre-training. To resolve this, we bring up a new pre-training task termed as visual point cloud forecasting - predicting future point clouds from historical visual input. The key merit of this task captures the synergic learning of semantics, 3D structures, and temporal dynamics. Hence it shows superiority in various downstream tasks. To cope with this new problem, we present ViDAR, a general model to pre-train downstream visual encoders. It first extracts historical embeddings by the encoder. These representations are then transformed to 3D geometric space via a novel Latent Rendering operator for future point cloud prediction. Experiments show significant gain in downstream tasks, e.g., 3.1% NDS on 3D detection, ~10% error reduction on motion forecasting, and ~15% less collision rate on planning.

Visual Point Cloud Forecasting enables Scalable Autonomous Driving

TL;DR

This work tackles the scarcity of scalable pre-training for visual autonomous driving by introducing visual point cloud forecasting as a pre-text task. It presents ViDAR, a three-component framework that learns semantic, geometric, and temporal representations from historical Image-LiDAR sequences to forecast future point clouds and BEV features, enabling effective pre-training of visual encoders. A key contribution is Latent Rendering, which overcomes the limitations of differentiable ray-casting by producing discriminative 3D geometric latent space and is complemented by an autoregressive Future Decoder. Empirical results on nuScenes show that ViDAR improves downstream 3D detection, semantic occupancy, motion forecasting, and open-loop planning, demonstrating strong data efficiency and end-to-end benefits for scalable visual autonomous driving.

Abstract

In contrast to extensive studies on general vision, pre-training for scalable visual autonomous driving remains seldom explored. Visual autonomous driving applications require features encompassing semantics, 3D geometry, and temporal information simultaneously for joint perception, prediction, and planning, posing dramatic challenges for pre-training. To resolve this, we bring up a new pre-training task termed as visual point cloud forecasting - predicting future point clouds from historical visual input. The key merit of this task captures the synergic learning of semantics, 3D structures, and temporal dynamics. Hence it shows superiority in various downstream tasks. To cope with this new problem, we present ViDAR, a general model to pre-train downstream visual encoders. It first extracts historical embeddings by the encoder. These representations are then transformed to 3D geometric space via a novel Latent Rendering operator for future point cloud prediction. Experiments show significant gain in downstream tasks, e.g., 3.1% NDS on 3D detection, ~10% error reduction on motion forecasting, and ~15% less collision rate on planning.
Paper Structure (29 sections, 12 equations, 12 figures, 14 tables)

This paper contains 29 sections, 12 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: ViDAR is a visual autonomous driving pre-training framework, which leverages the estimation of future point clouds from historical visual inputs as the pre-text task. We term this new pre-text task as visual point cloud forecasting. With the aid of ViDAR, we achieve substantial improvement spanning a diverse spectrum of downstream applications for perception, prediction, and planning.
  • Figure 2: Comparisons among visual autonomous driving pre-training paradigms and our ViDAR architecture. Compared to existing methods, visual point cloud forecasting jointly models multi-view geometry and temporal dynamics. We then propose ViDAR, using Image-LiDAR sequences to pre-train visual encoders.
  • Figure 3: Ray-shaped Features vs. Geometric Features. Ray-shaped features show similar feature responses on BEV grids along the same ray; while geometric features from the Latent Rendering maintain discriminative 3D geometry and can describe the 3D world in latent space.
  • Figure 4: Multi-group Latent Rendering comprises several Latent Rendering running in parallel for different channels. Latent Rendering captures geometric features by the conditional probability function and the feature extraction function. "$\bigoplus$" means concatenating multi-group features among channel dimensions.
  • Figure 5: Future Decoder iteratively predicts the next BEV features, $\hat{\mathcal{F}}_t$, from the conditions of ego-motion $\mathbf{e}_{t}$ and the last BEV features, to enable specific future predictions with any ego-control.
  • ...and 7 more figures