Table of Contents
Fetching ...

SparseCoop: Cooperative Perception with Kinematic-Grounded Queries

Jiahao Wang, Zhongwei Jiang, Wenchao Sun, Jiaru Zhong, Haibao Yu, Yuner Zhang, Chenyang Lu, Chuang Zhang, Lei He, Shaobing Xu, Jianqiang Wang

TL;DR

SparseCoop presents a fully sparse cooperative perception framework that removes dense BEV features and relies on kinematic-grounded queries (KGQ) for precise spatio-temporal alignment across asynchronous viewpoints.The method introduces a coarse-to-fine aggregation pipeline and a cooperative instance denoising task to stabilize training and improve fusion among ego and cooperative agents.Across V2X-Seq and Griffin benchmarks, SparseCoop achieves state-of-the-art detection and tracking performance with significantly lower transmission costs and robust latency tolerance, demonstrating practical viability for real-world cooperative perception.Key innovations include the explicit 11D state in KGQ, a Geo-Appearance Matching strategy for cross-agent pairing, and an attention-based refinement framework grounded in ego-vehicle imagery.

Abstract

Cooperative perception is critical for autonomous driving, overcoming the inherent limitations of a single vehicle, such as occlusions and constrained fields-of-view. However, current approaches sharing dense Bird's-Eye-View (BEV) features are constrained by quadratically-scaling communication costs and the lack of flexibility and interpretability for precise alignment across asynchronous or disparate viewpoints. While emerging sparse query-based methods offer an alternative, they often suffer from inadequate geometric representations, suboptimal fusion strategies, and training instability. In this paper, we propose SparseCoop, a fully sparse cooperative perception framework for 3D detection and tracking that completely discards intermediate BEV representations. Our framework features a trio of innovations: a kinematic-grounded instance query that uses an explicit state vector with 3D geometry and velocity for precise spatio-temporal alignment; a coarse-to-fine aggregation module for robust fusion; and a cooperative instance denoising task to accelerate and stabilize training. Experiments on V2X-Seq and Griffin datasets show SparseCoop achieves state-of-the-art performance. Notably, it delivers this with superior computational efficiency, low transmission cost, and strong robustness to communication latency. Code is available at https://github.com/wang-jh18-SVM/SparseCoop.

SparseCoop: Cooperative Perception with Kinematic-Grounded Queries

TL;DR

SparseCoop presents a fully sparse cooperative perception framework that removes dense BEV features and relies on kinematic-grounded queries (KGQ) for precise spatio-temporal alignment across asynchronous viewpoints.The method introduces a coarse-to-fine aggregation pipeline and a cooperative instance denoising task to stabilize training and improve fusion among ego and cooperative agents.Across V2X-Seq and Griffin benchmarks, SparseCoop achieves state-of-the-art detection and tracking performance with significantly lower transmission costs and robust latency tolerance, demonstrating practical viability for real-world cooperative perception.Key innovations include the explicit 11D state in KGQ, a Geo-Appearance Matching strategy for cross-agent pairing, and an attention-based refinement framework grounded in ego-vehicle imagery.

Abstract

Cooperative perception is critical for autonomous driving, overcoming the inherent limitations of a single vehicle, such as occlusions and constrained fields-of-view. However, current approaches sharing dense Bird's-Eye-View (BEV) features are constrained by quadratically-scaling communication costs and the lack of flexibility and interpretability for precise alignment across asynchronous or disparate viewpoints. While emerging sparse query-based methods offer an alternative, they often suffer from inadequate geometric representations, suboptimal fusion strategies, and training instability. In this paper, we propose SparseCoop, a fully sparse cooperative perception framework for 3D detection and tracking that completely discards intermediate BEV representations. Our framework features a trio of innovations: a kinematic-grounded instance query that uses an explicit state vector with 3D geometry and velocity for precise spatio-temporal alignment; a coarse-to-fine aggregation module for robust fusion; and a cooperative instance denoising task to accelerate and stabilize training. Experiments on V2X-Seq and Griffin datasets show SparseCoop achieves state-of-the-art performance. Notably, it delivers this with superior computational efficiency, low transmission cost, and strong robustness to communication latency. Code is available at https://github.com/wang-jh18-SVM/SparseCoop.

Paper Structure

This paper contains 15 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Comparison of cooperative perception pipelines. Existing methods (a) are bottlenecked by the intermediate dense BEV features, even when these features are selected, compressed (a.1) or encoded into sparse queries anchored to reference points (a.2). In contrast, SparseCoop (b) is a fully sparse paradigm that bypasses BEV step, directly extracting queries grounded by rich state vectors from image features.
  • Figure 2: Performance comparison on V2X-Seq dataset. The X-axis and Y-axis represent perception metrics, while the bubble size and color encode the transmission cost on a logarithmic scale.
  • Figure 3: An overview of the SparseCoop framework. Each agent independently performs Sparse Instance Extraction. The ego-vehicle then uses the proposed Kinematic-Grounded Association and Coarse-to-Fine Aggregation modules to fuse transmitted instances with its own. Cooperative Instance Denoising (dashed lines) is only active during training to stabilize convergence.
  • Figure 4: Spatio-Temporal Alignment for KGQ state vectors
  • Figure 5: Motivation for CID. (a) A significant portion of ground-truth objects are visible to only one agent, limiting opportunities for cooperative supervision. (b) Even when an object is visible to both agents, predictions for the same GT (2) can be too far apart to be matched, further reducing positive samples for the fusion module.
  • ...and 2 more figures