Table of Contents
Fetching ...

IROAM: Improving Roadside Monocular 3D Object Detection Learning from Autonomous Vehicle Data Domain

Zhe Wang, Xiaoliang Huo, Siqi Fan, Jingjing Liu, Ya-Qin Zhang, Yan Wang

TL;DR

This work tackles the challenge of roadside monocular 3D object detection, hindered by a view-domain gap between vehicle-side and roadside imagery. It introduces IROAM, a semantic-geometry decoupled contrastive learning framework that jointly processes roadside and vehicle-side data using a DETR-based detector with two modules: In-Domain Query Interaction and Cross-Domain Query Enhancement. By decoupling queries into semantic and geometry parts and applying contrastive learning solely to the semantic component, IROAM leverages abundant vehicle-side data to improve roadside detection, as demonstrated by significant gains across multiple datasets and data-balancing scenarios. The approach reduces the reliance on extensive roadside data, improves cross-domain generalization, and establishes a pathway for scalable, real-world deployment of cross-view monocular 3D perception systems.

Abstract

In autonomous driving, The perception capabilities of the ego-vehicle can be improved with roadside sensors, which can provide a holistic view of the environment. However, existing monocular detection methods designed for vehicle cameras are not suitable for roadside cameras due to viewpoint domain gaps. To bridge this gap and Improve ROAdside Monocular 3D object detection, we propose IROAM, a semantic-geometry decoupled contrastive learning framework, which takes vehicle-side and roadside data as input simultaneously. IROAM has two significant modules. In-Domain Query Interaction module utilizes a transformer to learn content and depth information for each domain and outputs object queries. Cross-Domain Query Enhancement To learn better feature representations from two domains, Cross-Domain Query Enhancement decouples queries into semantic and geometry parts and only the former is used for contrastive learning. Experiments demonstrate the effectiveness of IROAM in improving roadside detector's performance. The results validate that IROAM has the capabilities to learn cross-domain information.

IROAM: Improving Roadside Monocular 3D Object Detection Learning from Autonomous Vehicle Data Domain

TL;DR

This work tackles the challenge of roadside monocular 3D object detection, hindered by a view-domain gap between vehicle-side and roadside imagery. It introduces IROAM, a semantic-geometry decoupled contrastive learning framework that jointly processes roadside and vehicle-side data using a DETR-based detector with two modules: In-Domain Query Interaction and Cross-Domain Query Enhancement. By decoupling queries into semantic and geometry parts and applying contrastive learning solely to the semantic component, IROAM leverages abundant vehicle-side data to improve roadside detection, as demonstrated by significant gains across multiple datasets and data-balancing scenarios. The approach reduces the reliance on extensive roadside data, improves cross-domain generalization, and establishes a pathway for scalable, real-world deployment of cross-view monocular 3D perception systems.

Abstract

In autonomous driving, The perception capabilities of the ego-vehicle can be improved with roadside sensors, which can provide a holistic view of the environment. However, existing monocular detection methods designed for vehicle cameras are not suitable for roadside cameras due to viewpoint domain gaps. To bridge this gap and Improve ROAdside Monocular 3D object detection, we propose IROAM, a semantic-geometry decoupled contrastive learning framework, which takes vehicle-side and roadside data as input simultaneously. IROAM has two significant modules. In-Domain Query Interaction module utilizes a transformer to learn content and depth information for each domain and outputs object queries. Cross-Domain Query Enhancement To learn better feature representations from two domains, Cross-Domain Query Enhancement decouples queries into semantic and geometry parts and only the former is used for contrastive learning. Experiments demonstrate the effectiveness of IROAM in improving roadside detector's performance. The results validate that IROAM has the capabilities to learn cross-domain information.

Paper Structure

This paper contains 15 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Vehicle-side and roadside data have view domain gaps. The same vehicle captured in vehicle-side and roadside images has similar semantic content with slight differences in viewpoint. However, geometry distributions of objects (like depth distribution) varies significantly between two data domains.
  • Figure 2: The framework of IROAM contains a roadside branch and a vehicle-side branch and each branch has the same architecture of Feature Encoder and In-Domain Query Interaction module. $\ast \in \{r,v\}$determines variables belonging to roadside or vehicle-side. For each branch, multi-scale image features $F_{C,\ast}$ can be obtained from input image $I_{\ast}$. They can be transformed into content feature $f_{C,\ast}$ and depth feature $f_{D,\ast}$. The DepthNet predicts a foreground depth map $D_{fg,\ast}$ and supervise it with ground-truth depth. $Q_{ \ast }$ can adaptively aggregate features from content embedding $f^{e}_{C,\ast}$ and depth embeddings $f^{e}_{D,\ast}$ and be updated as $Q^{d}_{ \ast }$, Then it will be transformed into positive sample set $Q^{P}_{ \ast }$ and negative sample set $Q^{N}_{ \ast }$ through Query Sampler. Finally, each object query will be decoupled into semantic and geometry parts and only the former is used for contrastive learning.
  • Figure 3: The procedure of semantic-geometry decoupled Contrastive Learning. Every query is disentangled into a semantic feature (colored one) and a geometry feature (grey one) and the former are used to calculate $l_{cl}$
  • Figure 4: Visualization results of IROAM. From BEV, it is clear that prediction bounding boxes (red) and labels (green) from IROAM are better aligned than Only-inf and Addon methods.
  • Figure 5: Analysis on different proportion ratios of vehicle-side to roadside data. Four groups of experiments (Using different icons to distinguish them) simulate a significant imbalance between roadside and vehicle-side data, which select $N_{r} = 2.2/3.3/4.4/8.8\text{K}$ roadside images respectively. Since Addon uses all vehicle-side data in one epoch, the ratio of vehicle-side data to roadside data $r$ is 4.23/2.82/2.12/1.06. IROAM samples partial images from all vehicle-side images to form pairs with roadside images for training. Only-Road repeats roadside data twice within one epoch for a fair comparison so that $r$ is always 1.0. The y-axis means the $AP_{3D}(IoU =0.7)$ for Moderate parts and the x-axis means the number of images used in one training epoch.