Table of Contents
Fetching ...

CoopDETR: A Unified Cooperative Perception Framework for 3D Detection via Object Query

Zhe Wang, Shaocong Xu, Xucai Zhuang, Tongda Xu, Yan Wang, Jingjing Liu, Yilun Chen, Ya-Qin Zhang

TL;DR

CoopDETR tackles the bandwidth bottleneck in multi-agent cooperative perception by shifting from region- or raw-data fusion to object-level feature cooperation through object queries. It encodes each agent's observations into a set of $N_q$ object queries via a PointDETR-based single-agent module and then performs cross-agent fusion using Spatial Query Matching and Object Query Aggregation to fuse queries within object-specific graphs. Empirically, CoopDETR achieves state-of-the-art AP on OPV2V and V2XSet while reducing transmission volume to approximately 1/782 of prior intermediate-fusion methods, and demonstrates robustness to pose errors in communication. This approach highlights the practicality and effectiveness of object-centric collaboration for scalable, high-performance cooperative perception in autonomous systems.

Abstract

Cooperative perception enhances the individual perception capabilities of autonomous vehicles (AVs) by providing a comprehensive view of the environment. However, balancing perception performance and transmission costs remains a significant challenge. Current approaches that transmit region-level features across agents are limited in interpretability and demand substantial bandwidth, making them unsuitable for practical applications. In this work, we propose CoopDETR, a novel cooperative perception framework that introduces object-level feature cooperation via object query. Our framework consists of two key modules: single-agent query generation, which efficiently encodes raw sensor data into object queries, reducing transmission cost while preserving essential information for detection; and cross-agent query fusion, which includes Spatial Query Matching (SQM) and Object Query Aggregation (OQA) to enable effective interaction between queries. Our experiments on the OPV2V and V2XSet datasets demonstrate that CoopDETR achieves state-of-the-art performance and significantly reduces transmission costs to 1/782 of previous methods.

CoopDETR: A Unified Cooperative Perception Framework for 3D Detection via Object Query

TL;DR

CoopDETR tackles the bandwidth bottleneck in multi-agent cooperative perception by shifting from region- or raw-data fusion to object-level feature cooperation through object queries. It encodes each agent's observations into a set of object queries via a PointDETR-based single-agent module and then performs cross-agent fusion using Spatial Query Matching and Object Query Aggregation to fuse queries within object-specific graphs. Empirically, CoopDETR achieves state-of-the-art AP on OPV2V and V2XSet while reducing transmission volume to approximately 1/782 of prior intermediate-fusion methods, and demonstrates robustness to pose errors in communication. This approach highlights the practicality and effectiveness of object-centric collaboration for scalable, high-performance cooperative perception in autonomous systems.

Abstract

Cooperative perception enhances the individual perception capabilities of autonomous vehicles (AVs) by providing a comprehensive view of the environment. However, balancing perception performance and transmission costs remains a significant challenge. Current approaches that transmit region-level features across agents are limited in interpretability and demand substantial bandwidth, making them unsuitable for practical applications. In this work, we propose CoopDETR, a novel cooperative perception framework that introduces object-level feature cooperation via object query. Our framework consists of two key modules: single-agent query generation, which efficiently encodes raw sensor data into object queries, reducing transmission cost while preserving essential information for detection; and cross-agent query fusion, which includes Spatial Query Matching (SQM) and Object Query Aggregation (OQA) to enable effective interaction between queries. Our experiments on the OPV2V and V2XSet datasets demonstrate that CoopDETR achieves state-of-the-art performance and significantly reduces transmission costs to 1/782 of previous methods.

Paper Structure

This paper contains 13 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Consider a typical cooperative perception scenario involving three connected agents (1 to 3) and five objects (A to E) to be detected. Each agent processes its respective point cloud and generates queries of surrounding objects using a DETR-based model. Queries corresponding to the same object in the scene can be connected to form an object query graph, facilitating further query fusion via attention mechanism. Subfigure (d) illustrates the object query graphs for objects A to E.
  • Figure 2: The general framework of CoopDETR. For each agent, the query generation module learns $N_q$ object queries from raw data. Each object in the scene will correspond to a query. For the whole multi-agent system, one object may be observed by different agents and be associated with different queries. Take $i$-th agent as ego agent, object queries $Q_{j} = \{q^{j}_{1},\dots,q^{j}_{N_q}\}$ from $j$-th agent and their reference points $r$ will be transmitted to $i$-th agent. In cross-agent query fusion module, all queries will be fused with two steps, the the first step is to associate different queries for co-aware objects through spatial query matching (SQM) and generate object query graph for each object. The second step is to fuse all queries in the same graph using Object Query aggregation (OQA) and generate a set of updated queries $\hat{Q}$, which will be fed to detection heads for category and bounding box prediction.
  • Figure 3: Illustration of PointDETR module.
  • Figure 4: The details of Cross-Agent Query Fusion.
  • Figure 5: Cooperative perception performance comparison of CoopDETR and other methods on V2XSet and OPV2V dataset. The communication volumes are also depicted in this figure.
  • ...and 3 more figures