Table of Contents
Fetching ...

StreamLTS: Query-based Temporal-Spatial LiDAR Fusion for Cooperative Object Detection

Yunshuang Yuan, Monika Sester

TL;DR

The paper tackles cooperative object detection under asynchronous LiDAR ticking times across multiple intelligent agents. It introduces Time-Aligned Cooperative Object Detection (TA-COOD) and a fully sparse, query-based temporal-spatial fusion framework named StreamLTS that uses per-object timestamps and a memory-augmented transformer to fuse data before sharing. It also adapts prominent datasets to reflect sub-frame asynchrony, creating OPV2Vt and DairV2Xt, and demonstrates superior accuracy and efficiency against state-of-the-art methods, especially at higher IoU thresholds. The findings show that modeling precise point-wise observation timestamps improves temporal context and time-related location predictability, with practical implications for robust cooperative perception and real-time autonomy.

Abstract

Cooperative perception via communication among intelligent traffic agents has great potential to improve the safety of autonomous driving. However, limited communication bandwidth, localization errors and asynchronized capturing time of sensor data, all introduce difficulties to the data fusion of different agents. To some extend, previous works have attempted to reduce the shared data size, mitigate the spatial feature misalignment caused by localization errors and communication delay. However, none of them have considered the asynchronized sensor ticking times, which can lead to dynamic object misplacement of more than one meter during data fusion. In this work, we propose Time-Aligned COoperative Object Detection (TA-COOD), for which we adapt widely used dataset OPV2V and DairV2X with considering asynchronous LiDAR sensor ticking times and build an efficient fully sparse framework with modeling the temporal information of individual objects with query-based techniques. The experiment results confirmed the superior efficiency of our fully sparse framework compared to the state-of-the-art dense models. More importantly, they show that the point-wise observation timestamps of the dynamic objects are crucial for accurate modeling the object temporal context and the predictability of their time-related locations. The official code is available at \url{https://github.com/YuanYunshuang/CoSense3D}.

StreamLTS: Query-based Temporal-Spatial LiDAR Fusion for Cooperative Object Detection

TL;DR

The paper tackles cooperative object detection under asynchronous LiDAR ticking times across multiple intelligent agents. It introduces Time-Aligned Cooperative Object Detection (TA-COOD) and a fully sparse, query-based temporal-spatial fusion framework named StreamLTS that uses per-object timestamps and a memory-augmented transformer to fuse data before sharing. It also adapts prominent datasets to reflect sub-frame asynchrony, creating OPV2Vt and DairV2Xt, and demonstrates superior accuracy and efficiency against state-of-the-art methods, especially at higher IoU thresholds. The findings show that modeling precise point-wise observation timestamps improves temporal context and time-related location predictability, with practical implications for robust cooperative perception and real-time autonomy.

Abstract

Cooperative perception via communication among intelligent traffic agents has great potential to improve the safety of autonomous driving. However, limited communication bandwidth, localization errors and asynchronized capturing time of sensor data, all introduce difficulties to the data fusion of different agents. To some extend, previous works have attempted to reduce the shared data size, mitigate the spatial feature misalignment caused by localization errors and communication delay. However, none of them have considered the asynchronized sensor ticking times, which can lead to dynamic object misplacement of more than one meter during data fusion. In this work, we propose Time-Aligned COoperative Object Detection (TA-COOD), for which we adapt widely used dataset OPV2V and DairV2X with considering asynchronous LiDAR sensor ticking times and build an efficient fully sparse framework with modeling the temporal information of individual objects with query-based techniques. The experiment results confirmed the superior efficiency of our fully sparse framework compared to the state-of-the-art dense models. More importantly, they show that the point-wise observation timestamps of the dynamic objects are crucial for accurate modeling the object temporal context and the predictability of their time-related locations. The official code is available at \url{https://github.com/YuanYunshuang/CoSense3D}.
Paper Structure (17 sections, 4 equations, 7 figures, 3 tables, 2 algorithms)

This paper contains 17 sections, 4 equations, 7 figures, 3 tables, 2 algorithms.

Figures (7)

  • Figure 1: TA-COOD: The scan points (blue to red scaled colors) and objects are observed at different timestamps, leading to unaligned object ground-truth (yellow boxes). The bounding boxes at the scan end of this frame are taken as the final ground-truth for TA-COOD (green boxes).
  • Figure 2: TA-COOD with two CAVs. Dashed boxes: ground-truth BBox of ego- and coop-CAV observations. Solid box: time-aligned ground-truth BBox.
  • Figure 3: Left: Overall framework for TA-COOD. Local Data Process encodes local point cloud features and fuse the temporal context into the object queries; Weights are shared across agents. Data Fusion Process fuses the spatial object query features from the $N$ IAs. Right: MTA alignment. Motion-Time-Aware embeds the query positions ($Q_p$) and Motion-Aware embeds the query context $Q_c$ and initial queries ($tgt$).
  • Figure 4: The details of the TempFusion and SpatialFusion module
  • Figure 5: TA-COOD result of StreamLTS. Yellow BBoxes: ego-view annotation. Green BBoxes: time-aligned annotation. Red BBoxes: time-aligned detection.
  • ...and 2 more figures