QUEST: Query Stream for Practical Cooperative Perception
Siqi Fan, Haibao Yu, Wenxian Yang, Jirui Yuan, Zaiqing Nie
TL;DR
The paper addresses occlusion and range limitations in autonomous driving by proposing query cooperation, a middle-ground paradigm between scene-level feature cooperation and instance-level result fusion. It introduces QUEST, a cross-agent query stream where transformer's object queries flow between agents and interact via fusion for co-aware objects and complementation for unseen ones, using a dual-space embedding and attentive fusion to align and merge queries. On DAIR-V2X-Seq, QUEST delivers substantial improvements over vehicle-only and traditional cooperation approaches, with $AP_{BEV|0.5}=20.3\%$ and $AP_{3D|0.5}=14.1\%$, and demonstrates transmission flexibility and robustness to packet dropout. The work highlights practical benefits for cross-agent perception, provides camera-centric cooperation labels, and outlines extensions toward temporal cooperation and end-to-end cooperative driving while noting deployment challenges such as the need for query-based onboard systems and cross-architecture alignment. These findings suggest that query-based interaction can offer a scalable, interpretable pathway to robust cooperative perception in real-world settings.
Abstract
Cooperative perception can effectively enhance individual perception performance by providing additional viewpoint and expanding the sensing field. Existing cooperation paradigms are either interpretable (result cooperation) or flexible (feature cooperation). In this paper, we propose the concept of query cooperation to enable interpretable instance-level flexible feature interaction. To specifically explain the concept, we propose a cooperative perception framework, termed QUEST, which let query stream flow among agents. The cross-agent queries are interacted via fusion for co-aware instances and complementation for individual unaware instances. Taking camera-based vehicle-infrastructure perception as a typical practical application scene, the experimental results on the real-world dataset, DAIR-V2X-Seq, demonstrate the effectiveness of QUEST and further reveal the advantage of the query cooperation paradigm on transmission flexibility and robustness to packet dropout. We hope our work can further facilitate the cross-agent representation interaction for better cooperative perception in practice.
