Table of Contents
Fetching ...

Towards Viewpoint-Robust End-to-End Autonomous Driving with 3D Foundation Model Priors

Hiroki Hashimoto, Hiromichi Goto, Hiroyuki Sugai, Hiroshi Kera, Kazuhiko Kawamoto

Abstract

Robust trajectory planning under camera viewpoint changes is important for scalable end-to-end autonomous driving. However, existing models often depend heavily on the camera viewpoints seen during training. We investigate an augmentation-free approach that leverages geometric priors from a 3D foundation model. The method injects per-pixel 3D positions derived from depth estimates as positional embeddings and fuses intermediate geometric features through cross-attention. Experiments on the VR-Drive camera viewpoint perturbation benchmark show reduced performance degradation under most perturbation conditions, with clear improvements under pitch and height perturbations. Gains under longitudinal translation are smaller, suggesting that more viewpoint-agnostic integration is needed for robustness to camera viewpoint changes.

Towards Viewpoint-Robust End-to-End Autonomous Driving with 3D Foundation Model Priors

Abstract

Robust trajectory planning under camera viewpoint changes is important for scalable end-to-end autonomous driving. However, existing models often depend heavily on the camera viewpoints seen during training. We investigate an augmentation-free approach that leverages geometric priors from a 3D foundation model. The method injects per-pixel 3D positions derived from depth estimates as positional embeddings and fuses intermediate geometric features through cross-attention. Experiments on the VR-Drive camera viewpoint perturbation benchmark show reduced performance degradation under most perturbation conditions, with clear improvements under pitch and height perturbations. Gains under longitudinal translation are smaller, suggesting that more viewpoint-agnostic integration is needed for robustness to camera viewpoint changes.

Paper Structure

This paper contains 20 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the proposed method. We extract geometric features and depth estimates from a 3D foundation model and integrate them into an end-to-end autonomous driving model to improve robustness against camera viewpoint changes.
  • Figure 2: Architecture of the proposed method. (a) 3D Spatial Encoder: computes depth-derived 3D positions from DA3 and camera parameters, and injects them as positional embeddings into image features. (b) Geometric Feature Fusion: fuses DA3 intermediate features into image features via cross-attention.
  • Figure 3: Camera images under different viewpoint perturbation conditions. Each column shows the Original, Pitch $-10^\circ$, and Depth $+1.0$ m conditions for the front and front-left cameras.
  • Figure 4: BEV trajectory comparison under viewpoint perturbations. Rows correspond to Height $+1.0$ m and Depth $+1.0$ m conditions. Under Height perturbation, the proposed method remains close to the ground truth, whereas World4Drive shows a larger deviation. Under Depth perturbation, both methods fail.