Table of Contents
Fetching ...

Zero-BEV: Zero-shot Projection of Any First-Person Modality to BEV Maps

Gianluca Monaci, Leonid Antsfeld, Boris Chidlovskii, Christian Wolf

TL;DR

Zero-BEV tackles the problem of projecting first-person modality information to BEV maps without depth and with zero-shot support across modalities. It achieves this by disentangling the geometric FPV→BEV projection from the modality translation, using synthetic data generation to decorrelate scene content from texture and a transformer-based cross-attention architecture to map FPV columns to BEV rays. The paper also explores an inductive-bias variant and a residual depth-guided variant, showing that the approach yields superior zero-shot BEV performance on semantic maps and can handle additional modalities such as motion and bounding boxes. The results demonstrate practical impact for flexible, depth-free BEV representations applicable to diverse tasks in robotics and autonomous systems.

Abstract

Bird's-eye view (BEV) maps are an important geometrically structured representation widely used in robotics, in particular self-driving vehicles and terrestrial robots. Existing algorithms either require depth information for the geometric projection, which is not always reliably available, or are trained end-to-end in a fully supervised way to map visual first-person observations to BEV representation, and are therefore restricted to the output modality they have been trained for. In contrast, we propose a new model capable of performing zero-shot projections of any modality available in a first person view to the corresponding BEV map. This is achieved by disentangling the geometric inverse perspective projection from the modality transformation, eg. RGB to occupancy. The method is general and we showcase experiments projecting to BEV three different modalities: semantic segmentation, motion vectors and object bounding boxes detected in first person. We experimentally show that the model outperforms competing methods, in particular the widely used baseline resorting to monocular depth estimation.

Zero-BEV: Zero-shot Projection of Any First-Person Modality to BEV Maps

TL;DR

Zero-BEV tackles the problem of projecting first-person modality information to BEV maps without depth and with zero-shot support across modalities. It achieves this by disentangling the geometric FPV→BEV projection from the modality translation, using synthetic data generation to decorrelate scene content from texture and a transformer-based cross-attention architecture to map FPV columns to BEV rays. The paper also explores an inductive-bias variant and a residual depth-guided variant, showing that the approach yields superior zero-shot BEV performance on semantic maps and can handle additional modalities such as motion and bounding boxes. The results demonstrate practical impact for flexible, depth-free BEV representations applicable to diverse tasks in robotics and autonomous systems.

Abstract

Bird's-eye view (BEV) maps are an important geometrically structured representation widely used in robotics, in particular self-driving vehicles and terrestrial robots. Existing algorithms either require depth information for the geometric projection, which is not always reliably available, or are trained end-to-end in a fully supervised way to map visual first-person observations to BEV representation, and are therefore restricted to the output modality they have been trained for. In contrast, we propose a new model capable of performing zero-shot projections of any modality available in a first person view to the corresponding BEV map. This is achieved by disentangling the geometric inverse perspective projection from the modality transformation, eg. RGB to occupancy. The method is general and we showcase experiments projecting to BEV three different modalities: semantic segmentation, motion vectors and object bounding boxes detected in first person. We experimentally show that the model outperforms competing methods, in particular the widely used baseline resorting to monocular depth estimation.
Paper Structure (19 sections, 1 theorem, 12 equations, 11 figures, 5 tables)

This paper contains 19 sections, 1 theorem, 12 equations, 11 figures, 5 tables.

Key Result

Theorem 3.1

Let a neural network with cross-attention as in Eqs. (eq:twostream1)-(eq:twostream2) and defined column/ raywise as in Eqs. (eq:attention)-(eq:attentionoutput) be trained on two streams of (FPV, BEV) pairs of different modalities, where pooling over the vertical dimension is done with a linear funct

Figures (11)

  • Figure 1: We train a model to project first-person views (FPVs) to BEV maps. (a) Existing work trains end-to-end the prediction of the target modality. (b) We disentangle two underlying transformations: ➀ the geometric projection from FPV to BEV , and ➁ an optional modality translation seen during training , eg. RGB to occupancy. This enables zero-shot projection of any modality unseen during training at deployment, leaving the modality unchanged and performing the geometric transformation only.
  • Figure 2: Models: (a) geometric solution based on monocular depth estimation (MDE), inverse projection $\mathcal{P}^{-1}$ and pooling to the ground; (b) end-to-end training to predict a target modality (not zero-shot capable); (c) Zero-BEV model, including optional auxiliary supervision, with feature extractor $\psi$, transformer, and U-Net; (d) Zero-BEV Residual, featuring the geometric solution; (e) model using inductive bias for disentangling (Section \ref{['ssec:inductivebias']}).
  • Figure 3: Data generation: we project procedurally generated random textures onto the 3D scene structure and then render the textured mesh into image pairs (first-person view, bird's-eye view) with perspective and orthographic projection, respectively.
  • Figure 4: Causal properties of the data generation process: for each sample $i$, $\mathbf{I}^{rgb}$ depends on scene geometry $g$ and semantics $s$. Confounders between $\mathbf{I}^{rgb}$ and $\mathbf{M}^{zero}$ are shaded in orange. $\mathcal{MD}$ is a modality definition, resulting in texture $t$. (a) A fixed meaningful $\mathcal{MD}$ leads to undesired confounders between $\mathbf{I}^{rgb}$ and $\mathbf{M}^{zero}$. (b) We vary $\mathcal{MD}$ over samples and keep it independent of the scene properties $(s,g)$, so the only confounder is geometry $g$.
  • Figure 5: Qualitative results on HM3DSem test scenes. (a.1) uses ground-truth depth and methods (c.2) and (c.3) are not zero-shot capable, thus not comparable. (a.2), (c) and (d) are zero-shot models --- see Table \ref{['tab:results_main']}. Zero-BEV models produce significantly better BEV maps.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Theorem 3.1: Disentangling property