MOSE: Boosting Vision-based Roadside 3D Object Detection with Scene Cues

Xiahan Chen; Mingjian Chen; Sanli Tang; Yi Niu; Jiang Zhu

MOSE: Boosting Vision-based Roadside 3D Object Detection with Scene Cues

Xiahan Chen, Mingjian Chen, Sanli Tang, Yi Niu, Jiang Zhu

TL;DR

MOSE tackles monocular roadside 3D object detection by exploiting frame-invariant scene cues derived from the road-ground relationship. It introduces a scene cue bank to aggregate cues across frames and fuses them with a deformable DETR-based 3D head that incorporates camera-parameter-aware position embeddings, enabling robust height-based lifting via the relative height $h_r$ and depth cues $d(u,v)$. The method achieves state-of-the-art results on Rope3D and DAIR-V2X-I, with strong generalization to heterologous scenes that vary intrinsic/extrinsic parameters and viewpoints, and improved long-distance localization. By emphasizing scene-specific, frame-invariant features and a principled data augmentation strategy for camera parameters, MOSE demonstrates substantial practical impact for infrastructure-based perception in autonomous systems.

Abstract

3D object detection based on roadside cameras is an additional way for autonomous driving to alleviate the challenges of occlusion and short perception range from vehicle cameras. Previous methods for roadside 3D object detection mainly focus on modeling the depth or height of objects, neglecting the stationary of cameras and the characteristic of inter-frame consistency. In this work, we propose a novel framework, namely MOSE, for MOnocular 3D object detection with Scene cuEs. The scene cues are the frame-invariant scene-specific features, which are crucial for object localization and can be intuitively regarded as the height between the surface of the real road and the virtual ground plane. In the proposed framework, a scene cue bank is designed to aggregate scene cues from multiple frames of the same scene with a carefully designed extrinsic augmentation strategy. Then, a transformer-based decoder lifts the aggregated scene cues as well as the 3D position embeddings for 3D object location, which boosts generalization ability in heterologous scenes. The extensive experiment results on two public benchmarks demonstrate the state-of-the-art performance of the proposed method, which surpasses the existing methods by a large margin.

MOSE: Boosting Vision-based Roadside 3D Object Detection with Scene Cues

TL;DR

and depth cues

. The method achieves state-of-the-art results on Rope3D and DAIR-V2X-I, with strong generalization to heterologous scenes that vary intrinsic/extrinsic parameters and viewpoints, and improved long-distance localization. By emphasizing scene-specific, frame-invariant features and a principled data augmentation strategy for camera parameters, MOSE demonstrates substantial practical impact for infrastructure-based perception in autonomous systems.

Abstract

Paper Structure (23 sections, 10 equations, 6 figures, 10 tables, 2 algorithms)

This paper contains 23 sections, 10 equations, 6 figures, 10 tables, 2 algorithms.

Introduction
Related Works
Method
Preliminary
Detection from Roadside Camera
Framework
Optimization
Experiment
Dataset
Implementation Details
Main Results
Results on the original benchmark
Analysis on heterlogous scenes.
Analysis on detecting ability
Ablation Study
...and 8 more sections

Figures (6)

Figure 1: Examples of roadside datasets. (a) Simulation of the roadside camera characteristic. (b) The relative height of all objects on the ground in a roadside scene, which is always smooth and unaltered in the same location. (c)Location distribution of the cars' bottom centers in the ground coordinate.
Figure 2: Overview framework of our method. A 2D detector is equipped to acquire objects' 2D position information in the current input frame. Then a scene cue bank is proposed to aggregate and decouple object features from image features by 2D object proposals. Finally, a 3D head based on deformable transformer decoders is adopted to infer 3D object bounding boxes.
Figure 3: (a)Height-based lifting method. (b) Sensitivity of height error relative to distance. x-axis and y-axis represent the distance and error of x in ground coordinate when camera height $H$ is 7m, and $\delta h_r$ is the prediction error of $h_r$. A tiny error of $h_r$ will lead to serious distance error, especially for long-distance objects.
Figure 4: The detected ratio of predicted objects w.r.t., the GT number in different distance thresholds. The distance means Euclid distance of a ground-truth with its nearest predicted object.
Figure 5: Qualitatively Comparisons on the Rope3D validation set in heterologous setting. In the BEV visualization, GT is the black box, while in the visual image, GT is white, and other colors are the detection results of different categories.
...and 1 more figures

MOSE: Boosting Vision-based Roadside 3D Object Detection with Scene Cues

TL;DR

Abstract

MOSE: Boosting Vision-based Roadside 3D Object Detection with Scene Cues

Authors

TL;DR

Abstract

Table of Contents

Figures (6)