RopeBEV: A Multi-Camera Roadside Perception Network in Bird's-Eye-View

Jinrang Jia; Guangqi Yi; Yifeng Shi

RopeBEV: A Multi-Camera Roadside Perception Network in Bird's-Eye-View

Jinrang Jia, Guangqi Yi, Yifeng Shi

TL;DR

RopeBEV introduces BEV augmentation to address the training balance issues caused by diverse camera poses and ranks 1st on the real-world highway dataset RoScenes and demonstrates its practical value on a private urban dataset that covers more than 50 intersections and 600 cameras.

Abstract

Multi-camera perception methods in Bird's-Eye-View (BEV) have gained wide application in autonomous driving. However, due to the differences between roadside and vehicle-side scenarios, there currently lacks a multi-camera BEV solution in roadside. This paper systematically analyzes the key challenges in multi-camera BEV perception for roadside scenarios compared to vehicle-side. These challenges include the diversity in camera poses, the uncertainty in Camera numbers, the sparsity in perception regions, and the ambiguity in orientation angles. In response, we introduce RopeBEV, the first dense multi-camera BEV approach. RopeBEV introduces BEV augmentation to address the training balance issues caused by diverse camera poses. By incorporating CamMask and ROIMask (Region of Interest Mask), it supports variable camera numbers and sparse perception, respectively. Finally, camera rotation embedding is utilized to resolve orientation ambiguity. Our method ranks 1st on the real-world highway dataset RoScenes and demonstrates its practical value on a private urban dataset that covers more than 50 intersections and 600 cameras.

RopeBEV: A Multi-Camera Roadside Perception Network in Bird's-Eye-View

TL;DR

Abstract

Paper Structure (18 sections, 1 equation, 7 figures, 3 tables)

This paper contains 18 sections, 1 equation, 7 figures, 3 tables.

Introduction
RELATED WORK
Roadside Camera-based Perception
Multi-Camera BEV Perception
Method
Overall
Diversity in Camera Poses
Uncertainty in Camera Numbers
Sparsity Perception Regions
Ambiguity in Orientation Angles
Experiments
Datasets
Implementation Details
Main Results
Ablation Study
...and 3 more sections

Figures (7)

Figure 1: An overview of the RopeBEV framework. The overall network follows a typical dense BEV architecture, which includes a backbone, a 2D-to-3D transformer, a temporal fusion module, and several task-specific heads. Considering the characteristics of roadside scenarios, RopeBEV introduces improvements in the 2D-to-3D transformer. The 2D-to-3D transformer can be divided into three stages: (1) Generate BEV Coordinate. In this stage, RopeBEV introduces BEV coordinate system data augmentation to address the training imbalance caused by the diversity of roadside camera poses. (2) Generate 2D-3D Mapping. Here, RopeBEV incorporates CamMask and ROIMask mechanisms to support customizable camera numbers and perception regions. (3) Generate BEV Feature. In this stage, RopeBEV integrates Camera Rotation Embedding into the features of single cameras to resolve orientation angle ambiguities.
Figure 2: Camera Views on vehicle-side and different roadside scenes. (a) illustrates the Camera View in a vehicle-side scenario. Regardless of where the vehicle travels, this view remains unchanged. Although Grid $P$ is never trained, it is also never utilized. (b) and (c) depict Camera Views from two different roadside scenarios. Due to the variability of real-world scenes, the Camera Views differ, leading to an imbalance in training. For instance, Grid $Q$ is trained in (b) but not in (c), while Grid $K$ is not trained in either (b) or (c). However, both Grid $Q$ and $K$ might still be used in other scenes, which could result in performance issues due to insufficient training. (d) demonstrates the application of BEV coordinate system translation and rotation for data augmentation in scenario (c). This augmentation allows Grids $Q$ and $K$ to be trained, addressing the training imbalance and ensuring that these grids are better prepared for use in various roadside scenarios.
Figure 3: ROIs on vehicle-side and roadside scenes. White regions on images are ROIs. Since roadside cameras are stationary, their ROIs are also fixed. However, because vehicles are in motion (from $Q$ to $P$), their ROIs vary as the vehicle's position changes. This distinction enables the use of ROIMask in roadside scenarios to filter out irrelevant perception areas, a method that cannot be applied to vehicle-side cameras.
Figure 4: Ambiguity in Orientation Angles. The camera deployment schemes in (a), (b), (c), and (d) are identical. In (a) and (c), the obstacle is a vehicle occupying multiple grids, whereas in (b) and (d), the obstacle is a pedestrian occupying only a single grid. The BEV coordinate system in (a) and (b) is centered at Camera A, with the Y-axis pointing to the right. As shown in the left-side schematic, the orientation angle of the obstacle is $\pi$. In (c) and (d), the BEV coordinate system is centered at Camera B, with the Y-axis pointing downward, and the obstacle's orientation angle is $\frac{3\pi}{2}$. When the BEV coordinate system shifts from (a) to (c), the BEV feature of the vehicle remains unchanged, but the 3D position encoding changes, resulting in a change in the orientation angle without ambiguity. However, when the BEV coordinate system shifts from (b) to (d), both the BEV feature and the 3D position encoding of the pedestrian remain unchanged, yet the orientation angle changes, leading to ambiguity.
Figure 5: Camera deployment of private dataset. Each intersection has 8 pinhole and 4 fisheye cameras.
...and 2 more figures

RopeBEV: A Multi-Camera Roadside Perception Network in Bird's-Eye-View

TL;DR

Abstract

RopeBEV: A Multi-Camera Roadside Perception Network in Bird's-Eye-View

Authors

TL;DR

Abstract

Table of Contents

Figures (7)