Table of Contents
Fetching ...

Geo-ConvGRU: Geographically Masked Convolutional Gated Recurrent Unit for Bird-Eye View Segmentation

Guanglei Yang, Yongqiang Zhang, Wanlong Li, Yu Tang, Weize Shang, Feng Wen, Hongbo Zhang, Mingli Ding

TL;DR

Geo-ConvGRU introduces a ConvGRU-based temporal module for Bird's-Eye View segmentation and augments it with a geographical mask that accounts for camera visibility. By replacing 3D temporal convolutions and injecting geometry-aware masking, the approach captures long-range temporal dependencies efficiently while suppressing noise from moving objects. The method achieves state-of-the-art results on NuScenes across BEV semantic segmentation, perceived maps, and future instance segmentation, highlighting improved accuracy with a favorable efficiency profile compared to Transformer-based temporal models. This work demonstrates a practical, geometry-informed temporal fusion strategy for real-time autonomous driving perception in BEV space.

Abstract

Convolutional Neural Networks (CNNs) have significantly impacted various computer vision tasks, however, they inherently struggle to model long-range dependencies explicitly due to the localized nature of convolution operations. Although Transformers have addressed limitations in long-range dependencies for the spatial dimension, the temporal dimension remains underexplored. In this paper, we first highlight that 3D CNNs exhibit limitations in capturing long-range temporal dependencies. Though Transformers mitigate spatial dimension issues, they result in a considerable increase in parameter and processing speed reduction. To overcome these challenges, we introduce a simple yet effective module, Geographically Masked Convolutional Gated Recurrent Unit (Geo-ConvGRU), tailored for Bird's-Eye View segmentation. Specifically, we substitute the 3D CNN layers with ConvGRU in the temporal module to bolster the capacity of networks for handling temporal dependencies. Additionally, we integrate a geographical mask into the Convolutional Gated Recurrent Unit to suppress noise introduced by the temporal module. Comprehensive experiments conducted on the NuScenes dataset substantiate the merits of the proposed Geo-ConvGRU, revealing that our approach attains state-of-the-art performance in Bird's-Eye View segmentation.

Geo-ConvGRU: Geographically Masked Convolutional Gated Recurrent Unit for Bird-Eye View Segmentation

TL;DR

Geo-ConvGRU introduces a ConvGRU-based temporal module for Bird's-Eye View segmentation and augments it with a geographical mask that accounts for camera visibility. By replacing 3D temporal convolutions and injecting geometry-aware masking, the approach captures long-range temporal dependencies efficiently while suppressing noise from moving objects. The method achieves state-of-the-art results on NuScenes across BEV semantic segmentation, perceived maps, and future instance segmentation, highlighting improved accuracy with a favorable efficiency profile compared to Transformer-based temporal models. This work demonstrates a practical, geometry-informed temporal fusion strategy for real-time autonomous driving perception in BEV space.

Abstract

Convolutional Neural Networks (CNNs) have significantly impacted various computer vision tasks, however, they inherently struggle to model long-range dependencies explicitly due to the localized nature of convolution operations. Although Transformers have addressed limitations in long-range dependencies for the spatial dimension, the temporal dimension remains underexplored. In this paper, we first highlight that 3D CNNs exhibit limitations in capturing long-range temporal dependencies. Though Transformers mitigate spatial dimension issues, they result in a considerable increase in parameter and processing speed reduction. To overcome these challenges, we introduce a simple yet effective module, Geographically Masked Convolutional Gated Recurrent Unit (Geo-ConvGRU), tailored for Bird's-Eye View segmentation. Specifically, we substitute the 3D CNN layers with ConvGRU in the temporal module to bolster the capacity of networks for handling temporal dependencies. Additionally, we integrate a geographical mask into the Convolutional Gated Recurrent Unit to suppress noise introduced by the temporal module. Comprehensive experiments conducted on the NuScenes dataset substantiate the merits of the proposed Geo-ConvGRU, revealing that our approach attains state-of-the-art performance in Bird's-Eye View segmentation.
Paper Structure (12 sections, 7 equations, 8 figures, 4 tables)

This paper contains 12 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Performance (IoU) vs. efficiency (Training Time) with the different temporal filed on bird's-eye view semantic segmentation.
  • Figure 2: The overview of a segmentation model for BEV segmentation.
  • Figure 3: A example for ConvGRU unit.The $\neg$ denotes NOT process.
  • Figure 4: A example for our Geo-ConvGRU module. $\mathcal{M}_{geo}$ denotes the geographical mask and $\otimes$ means the element-wise product operation.
  • Figure 5: Qualitative results on BEV semantic segmentation. The resolution setting is 100m × 100m at 50cm resolution. For the best view, the cars is marked in instance level.
  • ...and 3 more figures