Table of Contents
Fetching ...

Focus on BEV: Self-calibrated Cycle View Transformation for Monocular Birds-Eye-View Segmentation

Jiawei Zhao, Qixing Jiang, Xuede Li, Junfeng Luo

TL;DR

A novel FocusBEV framework consisting of a self-calibrated cross view transformation module to suppress the BEV-agnostic image areas and focus on the BEV-relevant areas in the view transformation stage, and a plug-and-play ego-motion-based temporal fusion module to exploit the spatiotemporal structure consistency in BEV space with a memory bank.

Abstract

Birds-Eye-View (BEV) segmentation aims to establish a spatial mapping from the perspective view to the top view and estimate the semantic maps from monocular images. Recent studies have encountered difficulties in view transformation due to the disruption of BEV-agnostic features in image space. To tackle this issue, we propose a novel FocusBEV framework consisting of $(i)$ a self-calibrated cross view transformation module to suppress the BEV-agnostic image areas and focus on the BEV-relevant areas in the view transformation stage, $(ii)$ a plug-and-play ego-motion-based temporal fusion module to exploit the spatiotemporal structure consistency in BEV space with a memory bank, and $(iii)$ an occupancy-agnostic IoU loss to mitigate both semantic and positional uncertainties. Experimental evidence demonstrates that our approach achieves new state-of-the-art on two popular benchmarks,\ie, 29.2\% mIoU on nuScenes and 35.2\% mIoU on Argoverse.

Focus on BEV: Self-calibrated Cycle View Transformation for Monocular Birds-Eye-View Segmentation

TL;DR

A novel FocusBEV framework consisting of a self-calibrated cross view transformation module to suppress the BEV-agnostic image areas and focus on the BEV-relevant areas in the view transformation stage, and a plug-and-play ego-motion-based temporal fusion module to exploit the spatiotemporal structure consistency in BEV space with a memory bank.

Abstract

Birds-Eye-View (BEV) segmentation aims to establish a spatial mapping from the perspective view to the top view and estimate the semantic maps from monocular images. Recent studies have encountered difficulties in view transformation due to the disruption of BEV-agnostic features in image space. To tackle this issue, we propose a novel FocusBEV framework consisting of a self-calibrated cross view transformation module to suppress the BEV-agnostic image areas and focus on the BEV-relevant areas in the view transformation stage, a plug-and-play ego-motion-based temporal fusion module to exploit the spatiotemporal structure consistency in BEV space with a memory bank, and an occupancy-agnostic IoU loss to mitigate both semantic and positional uncertainties. Experimental evidence demonstrates that our approach achieves new state-of-the-art on two popular benchmarks,\ie, 29.2\% mIoU on nuScenes and 35.2\% mIoU on Argoverse.

Paper Structure

This paper contains 13 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Motivation of the proposed self-calibrated cycle view transformation. (a) Initial PV-BEV transformation is disrupted by BEV-agnostic areas ( e.g., sky and buildings), whereas (b) the cycle view transformation could suppress these BEV-agnostic areas and concentrate on BEV-relevant areas in a self-calibrated manner.
  • Figure 2: The overall pipeline of our proposed FocusBEV framework. The backbone and FPN extract PV features. The cycle view transformation then transforms the spatial features of PV space to the BEV space using cyclical mapping. The BEV temporal fusion aligns the history BEV features and aggregates them spatiotemporally with the current frame. The top-down network upsamples BEV features to predict a semantic occupancy map.
  • Figure 3: Our proposed self-calibrated cycle view transformation module consists of two stages: the PV-BEV transformation and the cycle view transformation.
  • Figure 4: Our proposed ego-motion-based temporal fusion module. The history BEV features are aligned with the reference frame using ego-motion. These features are then stacked and aggregated to enhance the spatiotemporal contextual information of the reference frame.
  • Figure 5: Qualiative results on the nuScenes dataset.