Table of Contents
Fetching ...

Leveraging Temporal Contexts to Enhance Vehicle-Infrastructure Cooperative Perception

Jiaru Zhong, Haibao Yu, Tianyi Zhu, Jiahui Xu, Wenxian Yang, Zaiqing Nie, Chao Sun

TL;DR

CTCE tackles cooperative perception by exploiting temporal contexts in a camera-based VIC3D setup. It introduces a multi-level temporal contexts integration mechanism with a roadside temporal context aggregation (TCA) and ego-vehicle temporal-guided fusion (TGF), plus a motion-aware reconstruction (MAR) to cope with lost roadside data during communication interruptions. The method transmits top-$m$ roadside queries via V2X, fuses with ego-vehicle queries, and reconstructs lost roadside data using historical trajectories, achieving $mAP$ gains of 3.8 percentage points on V2X-Seq and 1.3 percentage points on V2X-Sim compared to QUEST, while demonstrating robustness across varying $PDR$. This work demonstrates practical viability and sets a new direction for temporal cooperative perception in ITS.

Abstract

Infrastructure sensors installed at elevated positions offer a broader perception range and encounter fewer occlusions. Integrating both infrastructure and ego-vehicle data through V2X communication, known as vehicle-infrastructure cooperation, has shown considerable advantages in enhancing perception capabilities and addressing corner cases encountered in single-vehicle autonomous driving. However, cooperative perception still faces numerous challenges, including limited communication bandwidth and practical communication interruptions. In this paper, we propose CTCE, a novel framework for cooperative 3D object detection. This framework transmits queries with temporal contexts enhancement, effectively balancing transmission efficiency and performance to accommodate real-world communication conditions. Additionally, we propose a temporal-guided fusion module to further improve performance. The roadside temporal enhancement and vehicle-side spatial-temporal fusion together constitute a multi-level temporal contexts integration mechanism, fully leveraging temporal information to enhance performance. Furthermore, a motion-aware reconstruction module is introduced to recover lost roadside queries due to communication interruptions. Experimental results on V2X-Seq and V2X-Sim datasets demonstrate that CTCE outperforms the baseline QUEST, achieving improvements of 3.8% and 1.3% in mAP, respectively. Experiments under communication interruption conditions validate CTCE's robustness to communication interruptions.

Leveraging Temporal Contexts to Enhance Vehicle-Infrastructure Cooperative Perception

TL;DR

CTCE tackles cooperative perception by exploiting temporal contexts in a camera-based VIC3D setup. It introduces a multi-level temporal contexts integration mechanism with a roadside temporal context aggregation (TCA) and ego-vehicle temporal-guided fusion (TGF), plus a motion-aware reconstruction (MAR) to cope with lost roadside data during communication interruptions. The method transmits top- roadside queries via V2X, fuses with ego-vehicle queries, and reconstructs lost roadside data using historical trajectories, achieving gains of 3.8 percentage points on V2X-Seq and 1.3 percentage points on V2X-Sim compared to QUEST, while demonstrating robustness across varying . This work demonstrates practical viability and sets a new direction for temporal cooperative perception in ITS.

Abstract

Infrastructure sensors installed at elevated positions offer a broader perception range and encounter fewer occlusions. Integrating both infrastructure and ego-vehicle data through V2X communication, known as vehicle-infrastructure cooperation, has shown considerable advantages in enhancing perception capabilities and addressing corner cases encountered in single-vehicle autonomous driving. However, cooperative perception still faces numerous challenges, including limited communication bandwidth and practical communication interruptions. In this paper, we propose CTCE, a novel framework for cooperative 3D object detection. This framework transmits queries with temporal contexts enhancement, effectively balancing transmission efficiency and performance to accommodate real-world communication conditions. Additionally, we propose a temporal-guided fusion module to further improve performance. The roadside temporal enhancement and vehicle-side spatial-temporal fusion together constitute a multi-level temporal contexts integration mechanism, fully leveraging temporal information to enhance performance. Furthermore, a motion-aware reconstruction module is introduced to recover lost roadside queries due to communication interruptions. Experimental results on V2X-Seq and V2X-Sim datasets demonstrate that CTCE outperforms the baseline QUEST, achieving improvements of 3.8% and 1.3% in mAP, respectively. Experiments under communication interruption conditions validate CTCE's robustness to communication interruptions.
Paper Structure (25 sections, 6 equations, 7 figures, 3 tables)

This paper contains 25 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The Comparison of Different Cooperation Methods. In contrast to single-frame spatial cooperation, our proposed multi-frame spatial-temporal cooperation utilizes temporal contexts in two aspects: extracting temporal features from roadside multiple frames and performing spatial-temporal fusion with roadside historical sequence.
  • Figure 2: The Illustration of Communication Interruption. Red and green boxes denote detection and ground truth results respectively. Compared to ideal communication, interruption can cause transmission loss, which harms cooperative detection.
  • Figure 3: Overview of the Proposed CTCE.At infrastructure side i) The roadside queries are generated from the image and interact with historical queries to obtain temporal queries. ii) The temporal queries are filtered by confidence and transmitted to the ego-vehicle through V2X communication. At ego-vehicle side i) ego-vehicle queries are extracted from ego-vehicle image. ii) A novel spatial-temporal fusion module fuses ego-vehicle queries, roadside queries, and stored roadside historical queries. iii) The fused queries are input to the detection head to generate the cooperative perception results. iv) a motion-aware reconstruction module is introduced to recover the lost roadside queries caused by communication interruptions, ensuring robustness.
  • Figure 4: Temporal-Guided Fusion Module. This module first fuses roadside and ego-vehicle queries, and then the coarse fused queries interact with roadside historical queries to refine the representation by temporal contexts.
  • Figure 5: The Pipeline of the Motion-Aware Reconstruction. It first tracks queries from historical frames and then predicts the lost queries. This simple yet effective module helps gain robustness to communication interruptions.
  • ...and 2 more figures