S2R-ViT for Multi-Agent Cooperative Perception: Bridging the Gap from Simulation to Reality

Jinlong Li; Runsheng Xu; Xinyu Liu; Baolu Li; Qin Zou; Jiaqi Ma; Hongkai Yu

S2R-ViT for Multi-Agent Cooperative Perception: Bridging the Gap from Simulation to Reality

Jinlong Li, Runsheng Xu, Xinyu Liu, Baolu Li, Qin Zou, Jiaqi Ma, Hongkai Yu

TL;DR

This work tackles the domain gap when transferring multi-agent cooperative perception models from simulation to reality by identifying Deployment Gap and Feature Gap. It introduces S2R-ViT, a Simulation-to-Reality transformer framework with two main components: S2R-UViT (uncertainty-aware vision transformer) to mitigate deployment-related uncertainties and S2R-AFA (agent-based feature adaptation) to reduce inter-agent feature discrepancies. The approach achieves state-of-the-art performance on OPV2V and V2V4Real for point cloud-based 3D object detection, especially under realistic degradation conditions and after domain-adaptation boosts. The proposed framework provides a practical path toward robust real-world deployment of cooperative perception systems.

Abstract

Due to the lack of enough real multi-agent data and time-consuming of labeling, existing multi-agent cooperative perception algorithms usually select the simulated sensor data for training and validating. However, the perception performance is degraded when these simulation-trained models are deployed to the real world, due to the significant domain gap between the simulated and real data. In this paper, we propose the first Simulation-to-Reality transfer learning framework for multi-agent cooperative perception using a novel Vision Transformer, named as S2R-ViT, which considers both the Deployment Gap and Feature Gap between simulated and real data. We investigate the effects of these two types of domain gaps and propose a novel uncertainty-aware vision transformer to effectively relief the Deployment Gap and an agent-based feature adaptation module with inter-agent and ego-agent discriminators to reduce the Feature Gap. Our intensive experiments on the public multi-agent cooperative perception datasets OPV2V and V2V4Real demonstrate that the proposed S2R-ViT can effectively bridge the gap from simulation to reality and outperform other methods significantly for point cloud-based 3D object detection.

S2R-ViT for Multi-Agent Cooperative Perception: Bridging the Gap from Simulation to Reality

TL;DR

Abstract

Paper Structure (13 sections, 6 equations, 5 figures, 2 tables)

This paper contains 13 sections, 6 equations, 5 figures, 2 tables.

Introduction
Related Work
Methodology
Overview of Architecture
S2R-UViT: Simulation-to-Reality Uncertainty-aware Vision Transformer
Local-and-Global Multi-head Self Attention (LG-MSA)
Uncertainty-Aware Module (UAM)
S2R-AFA: Simulation-to-Reality Agent-based Feature Adaptation
Experiment
Dataset
Experiments Setup
Quantitative Evaluation
Conclusions

Figures (5)

Figure 1: Illustration of the domain gap (Deployment Gap, Feature Gap) for multi-agent cooperative perception from simulation to reality. Here we use Vehicle-to-Vehicle (V2V) cooperative perception in autonomous driving as example. CAV indicates the Connected Autonomous Vehicles.
Figure 2: Overview of the proposed S2R-ViT for multi-agent cooperative perception from simulation to reality, which leverages S2R-UViT module to handle uncertainty in Deployment Gap and tackles the Feature Gap through S2R-AFA module including the inter-agent discriminator $D_i$ and the ego-agent discriminator $D_e$. Source Domain: labeled simulated data, Target Domain: unlabeled real-world data. Best viewed in color.
Figure 3: Architecture of the proposed S2R-UViT: Simulation-to-Reality Uncertainty-aware Vision Transformer.
Figure 4: Robustness in Deployment-Gap Scenario including GPS (positional, heading) errors and communication latency on the CARLA Towns testing set of OPV2V xu2022opv2v with the simulated Noisy Setting.
Figure 5: Visualization example of point cloud-based 3D object detection on V2V4Real testing set under the Sim2Real Scenario. Green and red 3D bounding boxes represent ground truth and prediction respectively. Best viewed in color.

S2R-ViT for Multi-Agent Cooperative Perception: Bridging the Gap from Simulation to Reality

TL;DR

Abstract

S2R-ViT for Multi-Agent Cooperative Perception: Bridging the Gap from Simulation to Reality

Authors

TL;DR

Abstract

Table of Contents

Figures (5)