SparseCoop: Cooperative Perception with Kinematic-Grounded Queries
Jiahao Wang, Zhongwei Jiang, Wenchao Sun, Jiaru Zhong, Haibao Yu, Yuner Zhang, Chenyang Lu, Chuang Zhang, Lei He, Shaobing Xu, Jianqiang Wang
TL;DR
SparseCoop presents a fully sparse cooperative perception framework that removes dense BEV features and relies on kinematic-grounded queries (KGQ) for precise spatio-temporal alignment across asynchronous viewpoints.The method introduces a coarse-to-fine aggregation pipeline and a cooperative instance denoising task to stabilize training and improve fusion among ego and cooperative agents.Across V2X-Seq and Griffin benchmarks, SparseCoop achieves state-of-the-art detection and tracking performance with significantly lower transmission costs and robust latency tolerance, demonstrating practical viability for real-world cooperative perception.Key innovations include the explicit 11D state in KGQ, a Geo-Appearance Matching strategy for cross-agent pairing, and an attention-based refinement framework grounded in ego-vehicle imagery.
Abstract
Cooperative perception is critical for autonomous driving, overcoming the inherent limitations of a single vehicle, such as occlusions and constrained fields-of-view. However, current approaches sharing dense Bird's-Eye-View (BEV) features are constrained by quadratically-scaling communication costs and the lack of flexibility and interpretability for precise alignment across asynchronous or disparate viewpoints. While emerging sparse query-based methods offer an alternative, they often suffer from inadequate geometric representations, suboptimal fusion strategies, and training instability. In this paper, we propose SparseCoop, a fully sparse cooperative perception framework for 3D detection and tracking that completely discards intermediate BEV representations. Our framework features a trio of innovations: a kinematic-grounded instance query that uses an explicit state vector with 3D geometry and velocity for precise spatio-temporal alignment; a coarse-to-fine aggregation module for robust fusion; and a cooperative instance denoising task to accelerate and stabilize training. Experiments on V2X-Seq and Griffin datasets show SparseCoop achieves state-of-the-art performance. Notably, it delivers this with superior computational efficiency, low transmission cost, and strong robustness to communication latency. Code is available at https://github.com/wang-jh18-SVM/SparseCoop.
