Table of Contents
Fetching ...

SparseAlign: A Fully Sparse Framework for Cooperative Object Detection

Yunshuang Yuan, Yan Xia, Daniel Cremers, Monika Sester

TL;DR

SparseAlign addresses the bandwidth and scalability challenges of LiDAR-based cooperative object detection by introducing a fully sparse framework with a Sparse UNet backbone (SUNet) built from Coordinate-Expandable Convolutions, a query-based temporal module (TAM), pose alignment (PAM), and spatial fusion (SAM) along with CompassRose orientation encoding. It demonstrates state-of-the-art performance on OPV2V and DairV2X while drastically reducing inter-vehicle communication via compact, query-based CPM sharing and robust handling of latency and sensor asynchrony. Key contributions include the SUNet backbone that mitigates CFM and ICF, temporal-spatial alignment modules that enable effective cross-agent fusion, and data augmentation (FSA) that improves long-range detection. The approach yields practical impact by enabling long-range cooperative perception with low bandwidth, improved robustness to localization errors, and strong TA-COOD performance, signaling meaningful advances for scalable real-world cooperative perception systems.

Abstract

Cooperative perception can increase the view field and decrease the occlusion of an ego vehicle, hence improving the perception performance and safety of autonomous driving. Despite the success of previous works on cooperative object detection, they mostly operate on dense Bird's Eye View (BEV) feature maps, which are computationally demanding and can hardly be extended to long-range detection problems. More efficient fully sparse frameworks are rarely explored. In this work, we design a fully sparse framework, SparseAlign, with three key features: an enhanced sparse 3D backbone, a query-based temporal context learning module, and a robust detection head specially tailored for sparse features. Extensive experimental results on both OPV2V and DairV2X datasets show that our framework, despite its sparsity, outperforms the state of the art with less communication bandwidth requirements. In addition, experiments on the OPV2Vt and DairV2Xt datasets for time-aligned cooperative object detection also show a significant performance gain compared to the baseline works.

SparseAlign: A Fully Sparse Framework for Cooperative Object Detection

TL;DR

SparseAlign addresses the bandwidth and scalability challenges of LiDAR-based cooperative object detection by introducing a fully sparse framework with a Sparse UNet backbone (SUNet) built from Coordinate-Expandable Convolutions, a query-based temporal module (TAM), pose alignment (PAM), and spatial fusion (SAM) along with CompassRose orientation encoding. It demonstrates state-of-the-art performance on OPV2V and DairV2X while drastically reducing inter-vehicle communication via compact, query-based CPM sharing and robust handling of latency and sensor asynchrony. Key contributions include the SUNet backbone that mitigates CFM and ICF, temporal-spatial alignment modules that enable effective cross-agent fusion, and data augmentation (FSA) that improves long-range detection. The approach yields practical impact by enabling long-range cooperative perception with low bandwidth, improved robustness to localization errors, and strong TA-COOD performance, signaling meaningful advances for scalable real-world cooperative perception systems.

Abstract

Cooperative perception can increase the view field and decrease the occlusion of an ego vehicle, hence improving the perception performance and safety of autonomous driving. Despite the success of previous works on cooperative object detection, they mostly operate on dense Bird's Eye View (BEV) feature maps, which are computationally demanding and can hardly be extended to long-range detection problems. More efficient fully sparse frameworks are rarely explored. In this work, we design a fully sparse framework, SparseAlign, with three key features: an enhanced sparse 3D backbone, a query-based temporal context learning module, and a robust detection head specially tailored for sparse features. Extensive experimental results on both OPV2V and DairV2X datasets show that our framework, despite its sparsity, outperforms the state of the art with less communication bandwidth requirements. In addition, experiments on the OPV2Vt and DairV2Xt datasets for time-aligned cooperative object detection also show a significant performance gain compared to the baseline works.

Paper Structure

This paper contains 19 sections, 6 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: a,b) Communication strategies for cooperative object detection. c-e) Issues of sparse convolutional backbone networks. c) Center Feature Missing (CFM). d,e) Isolated Convolution Field (ICF) caused by ring disconnectivity and occlusion.
  • Figure 2: Overview of SparseAlign framework
  • Figure 3: Receptive field (RF) after 4 sparse convolutions (convs). a. LiDAR points; b. Spare pixels (white) with a conv center point (red); c. RF coverage of 4 normal sparse convs (red); d. RF coverage of 4 CECs (red+light blue, darkest blue is background).
  • Figure 4: CompassRose encoding
  • Figure 5: AP at IoU threshold of 0.7 with translation errors ranging from 0m to 1m along x- and y-axis, and rotation errors from $0^\circ$ to $1.0^\circ$ (horizontal axis) for the different datasets.
  • ...and 6 more figures