Table of Contents
Fetching ...

RendBEV: Semantic Novel View Synthesis for Self-Supervised Bird's Eye View Segmentation

Henrique Piñeiro Monteagudo, Leonardo Taccari, Aurel Pjetri, Francesco Sambo, Samuele Salti

TL;DR

RendBEV tackles the challenge of BEV semantic segmentation when BEV annotations are scarce by enabling self-supervised training through differentiable volumetric rendering. It renders perspective-view semantics for other frames from a reference BEV prediction using a frozen neural density field, and learns via a cross-entropy loss against ground-truth perspective semantics. The method is architecture-agnostic and benefits from temporal supervision, achieving strong zero-shot performance on KITTI-360, substantial gains as a pretraining step in low-annotation regimes, and state-of-the-art results when fully labeled. This work advances self-supervised BEV understanding and offers practical benefits for data-scarce autonomous-driving scenarios, while outlining future directions for handling dynamic objects and multi-camera setups.

Abstract

Bird's Eye View (BEV) semantic maps have recently garnered a lot of attention as a useful representation of the environment to tackle assisted and autonomous driving tasks. However, most of the existing work focuses on the fully supervised setting, training networks on large annotated datasets. In this work, we present RendBEV, a new method for the self-supervised training of BEV semantic segmentation networks, leveraging differentiable volumetric rendering to receive supervision from semantic perspective views computed by a 2D semantic segmentation model. Our method enables zero-shot BEV semantic segmentation, and already delivers competitive results in this challenging setting. When used as pretraining to then fine-tune on labeled BEV ground-truth, our method significantly boosts performance in low-annotation regimes, and sets a new state of the art when fine-tuning on all available labels.

RendBEV: Semantic Novel View Synthesis for Self-Supervised Bird's Eye View Segmentation

TL;DR

RendBEV tackles the challenge of BEV semantic segmentation when BEV annotations are scarce by enabling self-supervised training through differentiable volumetric rendering. It renders perspective-view semantics for other frames from a reference BEV prediction using a frozen neural density field, and learns via a cross-entropy loss against ground-truth perspective semantics. The method is architecture-agnostic and benefits from temporal supervision, achieving strong zero-shot performance on KITTI-360, substantial gains as a pretraining step in low-annotation regimes, and state-of-the-art results when fully labeled. This work advances self-supervised BEV understanding and offers practical benefits for data-scarce autonomous-driving scenarios, while outlining future directions for handling dynamic objects and multi-camera setups.

Abstract

Bird's Eye View (BEV) semantic maps have recently garnered a lot of attention as a useful representation of the environment to tackle assisted and autonomous driving tasks. However, most of the existing work focuses on the fully supervised setting, training networks on large annotated datasets. In this work, we present RendBEV, a new method for the self-supervised training of BEV semantic segmentation networks, leveraging differentiable volumetric rendering to receive supervision from semantic perspective views computed by a 2D semantic segmentation model. Our method enables zero-shot BEV semantic segmentation, and already delivers competitive results in this challenging setting. When used as pretraining to then fine-tune on labeled BEV ground-truth, our method significantly boosts performance in low-annotation regimes, and sets a new state of the art when fine-tuning on all available labels.

Paper Structure

This paper contains 20 sections, 7 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: RendBEV performs self-supervised training of any BEV semantic segmentation model via volumetric rendering. It enables to train BEV semantic segmentation architectures in the absence of any labeled or pseudolabeled BEV data and provides state-of-the-art performance in the low-annotation regime when used as a pretraining strategy.
  • Figure 2: RendBEV, our method for self-supervised training of BEV semantic segmentation models: we perform a forward pass with a reference view $I^r$ as input of the BEV network. We render the semantic semantic segmentation of another view $\hat{S}^k$, with class probability values $l^k_{\mathbf{x_i}}$ sampled from the BEV prediction $\hat{B}^r$ and densities $\sigma_{\mathbf{x}_i}$ queried from a pretrained frozen model $\omega$ that receives the target frame $I^k$ as input. We supervise the network with a cross entropy loss computed with the rendered semantic segmentation $\hat{S}^k$ and the target semantic segmentation $S^k$.
  • Figure 3: Importance of rendering temporally far frames with respect to the reference one: by reconstructing future frames we supervise areas which are occluded in the reference frame and provide denser supervision in spatially faraway areas which otherwise would only be supervised by a very small amount of pixels in the reference frame.
  • Figure 4: Effect of the number of patches per sequence on the mIoU. The marker size is proportional to the time per iteration during training.
  • Figure 5: Qualitative results of our model at different annotation regimes and GT BEV semantic segmentation
  • ...and 2 more figures