Table of Contents
Fetching ...

Visual Implicit Geometry Transformer for Autonomous Driving

Arsenii Shirokov, Mikhail Kuznetsov, Danila Stepochkin, Egor Evdokimov, Daniil Glazkov, Nikolay Patakin, Anton Konushin, Dmitry Senushkin

TL;DR

ViGT tackles the challenge of obtaining metric-scale 3D geometry from monocular surround-view camera rigs in autonomous driving. It introduces a calibration-free, transformer-based pipeline that learns a continuous 3D occupancy field in BEV by fusing multi-view features through an implicit projection and a query-based decoder, trained with self-supervision from synchronized image-LiDAR data. The approach supports rendering into multiple geometric representations (point clouds, occupancy fields, voxel grids) and demonstrates state-of-the-art performance on Occ3D-NuScenes for occupancy and strong results on cross-dataset pointmap estimation. This work highlights the potential of calibration-free, end-to-end geometric priors as foundational models for scalable and generalizable autonomous driving perception systems.

Abstract

We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model that estimates continuous 3D occupancy fields from surround-view camera rigs. ViGT represents a step towards foundational geometric models for autonomous driving, prioritizing scalability, architectural simplicity, and generalization across diverse sensor configurations. Our approach achieves this through a calibration-free architecture, enabling a single model to adapt to different sensor setups. Unlike general-purpose geometric foundational models that focus on pixel-aligned predictions, ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements. ViGT naturally infers geometry from multiple camera views into a single metric coordinate frame, providing a common representation for multiple geometric tasks. Unlike most existing occupancy models, we adopt a self-supervised training procedure that leverages synchronized image-LiDAR pairs, eliminating the need for costly manual annotations. We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets (NuScenes, Waymo, NuPlan, ONCE, and Argoverse) and achieving state-of-the-art performance on the pointmap estimation task, with the best average rank across all evaluated baselines. We further evaluate ViGT on the Occ3D-nuScenes benchmark, where ViGT achieves comparable performance with supervised methods. The source code is publicly available at \href{https://github.com/whesense/ViGT}{https://github.com/whesense/ViGT}.

Visual Implicit Geometry Transformer for Autonomous Driving

TL;DR

ViGT tackles the challenge of obtaining metric-scale 3D geometry from monocular surround-view camera rigs in autonomous driving. It introduces a calibration-free, transformer-based pipeline that learns a continuous 3D occupancy field in BEV by fusing multi-view features through an implicit projection and a query-based decoder, trained with self-supervision from synchronized image-LiDAR data. The approach supports rendering into multiple geometric representations (point clouds, occupancy fields, voxel grids) and demonstrates state-of-the-art performance on Occ3D-NuScenes for occupancy and strong results on cross-dataset pointmap estimation. This work highlights the potential of calibration-free, end-to-end geometric priors as foundational models for scalable and generalizable autonomous driving perception systems.

Abstract

We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model that estimates continuous 3D occupancy fields from surround-view camera rigs. ViGT represents a step towards foundational geometric models for autonomous driving, prioritizing scalability, architectural simplicity, and generalization across diverse sensor configurations. Our approach achieves this through a calibration-free architecture, enabling a single model to adapt to different sensor setups. Unlike general-purpose geometric foundational models that focus on pixel-aligned predictions, ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements. ViGT naturally infers geometry from multiple camera views into a single metric coordinate frame, providing a common representation for multiple geometric tasks. Unlike most existing occupancy models, we adopt a self-supervised training procedure that leverages synchronized image-LiDAR pairs, eliminating the need for costly manual annotations. We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets (NuScenes, Waymo, NuPlan, ONCE, and Argoverse) and achieving state-of-the-art performance on the pointmap estimation task, with the best average rank across all evaluated baselines. We further evaluate ViGT on the Occ3D-nuScenes benchmark, where ViGT achieves comparable performance with supervised methods. The source code is publicly available at \href{https://github.com/whesense/ViGT}{https://github.com/whesense/ViGT}.
Paper Structure (38 sections, 4 equations, 13 figures, 3 tables)

This paper contains 38 sections, 4 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Our Visual Implicit Geometry Transformer outperforms the most recent fundamental geometric models on publicly available autonomous driving datasets in Chamfer Distance$\downarrow$. Positions closer to the center indicate better performance.
  • Figure 2: Our architecture consists of three main components: (1) an image encoder (ViT-L) that independently processes each image and extracts feature tokens from the last four layers, producing four sequences of tokens per image; (2) a calibration-free Implicit BEV Projection module that projects tokens from each encoder layer across all images to their corresponding BEV space, generating four layer-specific BEV representations, which are then aggregated and upsampled into a single unified BEV representation using DPT; and (3) a query-based Implicit Decoder that predicts occupancy probabilities for 3D points from the final BEV features. This design enables pure data-driven scene modeling without geometric inductive biases.
  • Figure 3: Cross-attention visualization demonstrating consistent correspondences between BEV queries and camera tokens. (Left) Image attention heatmaps for selected BEV queries, (Right) BEV attention heatmaps for selected image queries. These visualizations confirm that the implicit BEV projection learns geometrically correct camera-to-BEV transformations.
  • Figure 4: We construct two sets of training points sampled along each LiDAR ray. Points before the reflection point are labeled as free space (negative label), while points near the reflection point correspond to occupied space (positive label).
  • Figure 5: The camera-to-bev attention responses (colored regions) accurately partition BEV space according to each camera's field of view, demonstrating that the implicit BEV projection correctly learns to map image features to their corresponding spatial sectors in BEV space without explicit camera parameters. Shown for different camera rigs configuration.
  • ...and 8 more figures