Table of Contents
Fetching ...

SGV3D:Towards Scenario Generalization for Vision-based Roadside 3D Object Detection

Lei Yang, Xinyu Zhang, Jun Li, Li Wang, Chuang Zhang, Li Ju, Zhiwei Li, Yang Shen

TL;DR

SGV3D tackles the challenge of scenario generalization for vision-based roadside 3D object detection by mitigating background overfitting and diversifying foreground instances. It introduces a Background-Suppressed BEV Detector (BMS) to attenuate background features during 2D→BEV projection and a Semi-supervised Data Generation Pipeline (SSDG) to synthesize diverse, well-labeled training data from unlabeled new-scene imagery, leveraging a multi-round self-training framework. Empirical results on the DAIR-V2X-I and Rope3D benchmarks show substantial gains in heterogeneous settings, with notable improvements over state-of-the-art methods across vehicles, pedestrians, and cyclists, and across cars and large vehicles on Rope3D. The work highlights the importance of separating background suppression and foreground enrichment to achieve robust cross-scene roadside perception, and it provides a practical, scalable approach for deployment in new environments.

Abstract

Roadside perception can greatly increase the safety of autonomous vehicles by extending their perception ability beyond the visual range and addressing blind spots. However, current state-of-the-art vision-based roadside detection methods possess high accuracy on labeled scenes but have inferior performance on new scenes. This is because roadside cameras remain stationary after installation and can only collect data from a single scene, resulting in the algorithm overfitting these roadside backgrounds and camera poses. To address this issue, in this paper, we propose an innovative Scenario Generalization Framework for Vision-based Roadside 3D Object Detection, dubbed SGV3D. Specifically, we employ a Background-suppressed Module (BSM) to mitigate background overfitting in vision-centric pipelines by attenuating background features during the 2D to bird's-eye-view projection. Furthermore, by introducing the Semi-supervised Data Generation Pipeline (SSDG) using unlabeled images from new scenes, diverse instance foregrounds with varying camera poses are generated, addressing the risk of overfitting specific camera poses. We evaluate our method on two large-scale roadside benchmarks. Our method surpasses all previous methods by a significant margin in new scenes, including +42.57% for vehicle, +5.87% for pedestrian, and +14.89% for cyclist compared to BEVHeight on the DAIR-V2X-I heterologous benchmark. On the larger-scale Rope3D heterologous benchmark, we achieve notable gains of 14.48% for car and 12.41% for large vehicle. We aspire to contribute insights on the exploration of roadside perception techniques, emphasizing their capability for scenario generalization. The code will be available at https://github.com/yanglei18/SGV3D

SGV3D:Towards Scenario Generalization for Vision-based Roadside 3D Object Detection

TL;DR

SGV3D tackles the challenge of scenario generalization for vision-based roadside 3D object detection by mitigating background overfitting and diversifying foreground instances. It introduces a Background-Suppressed BEV Detector (BMS) to attenuate background features during 2D→BEV projection and a Semi-supervised Data Generation Pipeline (SSDG) to synthesize diverse, well-labeled training data from unlabeled new-scene imagery, leveraging a multi-round self-training framework. Empirical results on the DAIR-V2X-I and Rope3D benchmarks show substantial gains in heterogeneous settings, with notable improvements over state-of-the-art methods across vehicles, pedestrians, and cyclists, and across cars and large vehicles on Rope3D. The work highlights the importance of separating background suppression and foreground enrichment to achieve robust cross-scene roadside perception, and it provides a practical, scalable approach for deployment in new environments.

Abstract

Roadside perception can greatly increase the safety of autonomous vehicles by extending their perception ability beyond the visual range and addressing blind spots. However, current state-of-the-art vision-based roadside detection methods possess high accuracy on labeled scenes but have inferior performance on new scenes. This is because roadside cameras remain stationary after installation and can only collect data from a single scene, resulting in the algorithm overfitting these roadside backgrounds and camera poses. To address this issue, in this paper, we propose an innovative Scenario Generalization Framework for Vision-based Roadside 3D Object Detection, dubbed SGV3D. Specifically, we employ a Background-suppressed Module (BSM) to mitigate background overfitting in vision-centric pipelines by attenuating background features during the 2D to bird's-eye-view projection. Furthermore, by introducing the Semi-supervised Data Generation Pipeline (SSDG) using unlabeled images from new scenes, diverse instance foregrounds with varying camera poses are generated, addressing the risk of overfitting specific camera poses. We evaluate our method on two large-scale roadside benchmarks. Our method surpasses all previous methods by a significant margin in new scenes, including +42.57% for vehicle, +5.87% for pedestrian, and +14.89% for cyclist compared to BEVHeight on the DAIR-V2X-I heterologous benchmark. On the larger-scale Rope3D heterologous benchmark, we achieve notable gains of 14.48% for car and 12.41% for large vehicle. We aspire to contribute insights on the exploration of roadside perception techniques, emphasizing their capability for scenario generalization. The code will be available at https://github.com/yanglei18/SGV3D
Paper Structure (13 sections, 15 equations, 8 figures, 9 tables)

This paper contains 13 sections, 15 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: (a) The homogeneous setting entails utilizing images from the same roadside scenes for both training and testing. (b) The heterogeneous setting involves training with images from labeled scenes but testing in entirely new and diverse scenes. (c) Presently, vision-based roadside 3D object detection methods demonstrate high performance in the homogeneous setting but experience a significant accuracy decline in the heterogeneous setting. Our SGV3D surpasses all state-of-the-art methods by a substantial margin in the heterogeneous setting, highlighting robust scenario generalization.
  • Figure 2: Empirical analysis on the poor scenario generalization of existing methods. (a) We present an overview of previous vision-centric roadside 3D object detectors. In particular, the background regions constitute the majority of BEV features. (b) We plot the pixel-level scatter diagram revealing the distance correlation between ground truth and predicted distance based on the BEVHeight. Here, we convert height to distance for a more intuitive comparison. The distance errors for the background regions marked as 'BG' in the new scenes are significantly larger than the errors in the labeled scenes. The foreground distance errors marked as 'FG' in new scenes are more pronounced compared to those in labeled scenes as well.
  • Figure 3: The overall framework of SGV3D. the Background-suppressed BEV Detector with Background-suppressed Module (BMS) derived from the BEVHeight yang2023bevheight. The Background-suppressed Module (BMS) reduces algorithm overfitting to the backgrounds of labeled scenes by suppressing background features with a semantic segmentation branch. Meanwhile, the Semi-supervised Data Generation Pipeline (SSDG) generates diverse, well-labeled images under different camera poses for the training stage, minimizing the risk of detector overfitting to specific camera settings, including intrinsic and extrinsic parameters. Our proposed framework performs multiple rounds of self-training strategy as STAC sohn2020simple.
  • Figure 4: Visualization of diverse foreground instances. The same car, when captured by cameras with different intrinsic and extrinsic parameters, exhibits markedly different shapes and sizes.
  • Figure 5: Visualization of the source data and its rectified data.$O_iXYZ$ implies the origin camera coordinate system. $O_{bg}XYZ$ represents the rectified camera coordinate system, which is consistent with that of background data.
  • ...and 3 more figures