Table of Contents
Fetching ...

From Features to Reference Points: Lightweight and Adaptive Fusion for Cooperative Autonomous Driving

Yongqi Zhu, Morui Zhu, Qi Chen, Deyuan Qu, Song Fu, Qing Yang

TL;DR

This work tackles bandwidth-efficient cooperative perception for autonomous driving by exchanging compact reference points rather than dense feature maps. The proposed RefPtsFusion framework uses interpretable geometric anchors (positions, velocities, sizes) and introduces Selective Top-K query fusion to augment the shared cues under varying network conditions, enabling robust cross-vehicle collaboration across heterogeneous backbones. On the M$^3$CAD dataset, RefPtsFusion achieves perception performance comparable to feature- and query-based fusion while reducing communication by over five orders of magnitude, with velocity and size cues further improving temporal and spatial consistency. The approach offers a scalable, real-time solution for cooperative driving with strong robustness and predictable communication behavior, paving the way for practical deployment in diverse vehicle fleets.

Abstract

We present RefPtsFusion, a lightweight and interpretable framework for cooperative autonomous driving. Instead of sharing large feature maps or query embeddings, vehicles exchange compact reference points, e.g., objects' positions, velocities, and size information. This approach shifts the focus from "what is seen" to "where to see", creating a sensor- and model-independent interface that works well across vehicles with heterogeneous perception models while greatly reducing communication bandwidth. To enhance the richness of shared information, we further develop a selective Top-K query fusion that selectively adds high-confidence queries from the sender. It thus achieves a strong balance between accuracy and communication cost. Experiments on the M3CAD dataset show that RefPtsFusion maintains stable perception performance while reducing communication overhead by five orders of magnitude, dropping from hundreds of MB/s to only a few KB/s at 5 FPS (frame per second), compared to traditional feature-level fusion methods. Extensive experiments also demonstrate RefPtsFusion's strong robustness and consistent transmission behavior, highlighting its potential for scalable, real-time cooperative driving systems.

From Features to Reference Points: Lightweight and Adaptive Fusion for Cooperative Autonomous Driving

TL;DR

This work tackles bandwidth-efficient cooperative perception for autonomous driving by exchanging compact reference points rather than dense feature maps. The proposed RefPtsFusion framework uses interpretable geometric anchors (positions, velocities, sizes) and introduces Selective Top-K query fusion to augment the shared cues under varying network conditions, enabling robust cross-vehicle collaboration across heterogeneous backbones. On the MCAD dataset, RefPtsFusion achieves perception performance comparable to feature- and query-based fusion while reducing communication by over five orders of magnitude, with velocity and size cues further improving temporal and spatial consistency. The approach offers a scalable, real-time solution for cooperative driving with strong robustness and predictable communication behavior, paving the way for practical deployment in diverse vehicle fleets.

Abstract

We present RefPtsFusion, a lightweight and interpretable framework for cooperative autonomous driving. Instead of sharing large feature maps or query embeddings, vehicles exchange compact reference points, e.g., objects' positions, velocities, and size information. This approach shifts the focus from "what is seen" to "where to see", creating a sensor- and model-independent interface that works well across vehicles with heterogeneous perception models while greatly reducing communication bandwidth. To enhance the richness of shared information, we further develop a selective Top-K query fusion that selectively adds high-confidence queries from the sender. It thus achieves a strong balance between accuracy and communication cost. Experiments on the M3CAD dataset show that RefPtsFusion maintains stable perception performance while reducing communication overhead by five orders of magnitude, dropping from hundreds of MB/s to only a few KB/s at 5 FPS (frame per second), compared to traditional feature-level fusion methods. Extensive experiments also demonstrate RefPtsFusion's strong robustness and consistent transmission behavior, highlighting its potential for scalable, real-time cooperative driving systems.

Paper Structure

This paper contains 20 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Performance–Bandwidth Trade-off. Comparison of different cooperative fusion paradigms in terms of perception accuracy (AMOTA) and communication cost. Bandwidth is computed using the actual number of effective reference points per frame, reflecting real-time communication. The proposed RefPtsFusion achieves comparable perception accuracy to feature-level fusion while reducing communication bandwidth by over five orders of magnitude. Bars with /// denote methods that explicitly support heterogeneous model fusion.
  • Figure 2: Overview of the proposed RefPtsFusion framework. It enables cooperative autonomous driving among heterogeneous vehicles through interpretable geometric information. Each sender may employ distinct perception backbones but only needs to transmit reference points, including positions, velocities, and sizes through V2V communication. The ego vehicle performs Cross-Agent Fusion, primarily conducting position-based fusion, while velocity and size information are optionally incorporated, further enhancing downstream perception tasks.
  • Figure 3: Distribution of valid queries per frame.
  • Figure 4: Qualitative comparison of cooperative perception over two consecutive frames. At time $t_n$, the ego vehicle fails to detect an object (a), while it is successfully perceived by the sender (b). With reference point fusion, both RefPtsFusion (c) and RefPtsFusion + V. (d) correctly localize the object. At the next frame $t_{n+1}$, the detection from RefPtsFusion gradually fades or disappears (g), whereas RefPtsFusion + V. maintains a stable detection (h), highlighting the benefit of incorporating velocity cues for temporal consistency.