Table of Contents
Fetching ...

V2X-PC: Vehicle-to-everything Collaborative Perception via Point Cluster

Si Liu, Zihan Ding, Jiahui Fu, Hongyu Li, Siheng Chen, Shifeng Zhang, Xu Zhou

TL;DR

This paper introduces V2X-PC, a vehicle-to-everything collaborative perception framework that replaces dense BEV-based messages with sparse point clusters to preserve object features and explicitly model structure. It presents three novel components: Point Cluster Packing (PCP) to control bandwidth while preserving geometry, Point Cluster Aggregation (PCA) to efficiently merge same-object clusters across agents, and a robust, parameter-free approach to handle pose errors and latency. Through experiments on DAIR-V2X-C and V2XSet, V2X-PC achieves state-of-the-art performance with favorable bandwidth trade-offs and demonstrates strong zero-shot robustness to noise and time delays. The work highlights the practical impact of sparse, structure-preserving representations for scalable and accurate V2X collaborative perception.

Abstract

The objective of the collaborative vehicle-to-everything perception task is to enhance the individual vehicle's perception capability through message communication among neighboring traffic agents. Previous methods focus on achieving optimal performance within bandwidth limitations and typically adopt BEV maps as the basic collaborative message units. However, we demonstrate that collaboration with dense representations is plagued by object feature destruction during message packing, inefficient message aggregation for long-range collaboration, and implicit structure representation communication. To tackle these issues, we introduce a brand new message unit, namely point cluster, designed to represent the scene sparsely with a combination of low-level structure information and high-level semantic information. The point cluster inherently preserves object information while packing messages, with weak relevance to the collaboration range, and supports explicit structure modeling. Building upon this representation, we propose a novel framework V2X-PC for collaborative perception. This framework includes a Point Cluster Packing (PCP) module to keep object feature and manage bandwidth through the manipulation of cluster point numbers. As for effective message aggregation, we propose a Point Cluster Aggregation (PCA) module to match and merge point clusters associated with the same object. To further handle time latency and pose errors encountered in real-world scenarios, we propose parameter-free solutions that can adapt to different noisy levels without finetuning. Experiments on two widely recognized collaborative perception benchmarks showcase the superior performance of our method compared to the previous state-of-the-art approaches relying on BEV maps.

V2X-PC: Vehicle-to-everything Collaborative Perception via Point Cluster

TL;DR

This paper introduces V2X-PC, a vehicle-to-everything collaborative perception framework that replaces dense BEV-based messages with sparse point clusters to preserve object features and explicitly model structure. It presents three novel components: Point Cluster Packing (PCP) to control bandwidth while preserving geometry, Point Cluster Aggregation (PCA) to efficiently merge same-object clusters across agents, and a robust, parameter-free approach to handle pose errors and latency. Through experiments on DAIR-V2X-C and V2XSet, V2X-PC achieves state-of-the-art performance with favorable bandwidth trade-offs and demonstrates strong zero-shot robustness to noise and time delays. The work highlights the practical impact of sparse, structure-preserving representations for scalable and accurate V2X collaborative perception.

Abstract

The objective of the collaborative vehicle-to-everything perception task is to enhance the individual vehicle's perception capability through message communication among neighboring traffic agents. Previous methods focus on achieving optimal performance within bandwidth limitations and typically adopt BEV maps as the basic collaborative message units. However, we demonstrate that collaboration with dense representations is plagued by object feature destruction during message packing, inefficient message aggregation for long-range collaboration, and implicit structure representation communication. To tackle these issues, we introduce a brand new message unit, namely point cluster, designed to represent the scene sparsely with a combination of low-level structure information and high-level semantic information. The point cluster inherently preserves object information while packing messages, with weak relevance to the collaboration range, and supports explicit structure modeling. Building upon this representation, we propose a novel framework V2X-PC for collaborative perception. This framework includes a Point Cluster Packing (PCP) module to keep object feature and manage bandwidth through the manipulation of cluster point numbers. As for effective message aggregation, we propose a Point Cluster Aggregation (PCA) module to match and merge point clusters associated with the same object. To further handle time latency and pose errors encountered in real-world scenarios, we propose parameter-free solutions that can adapt to different noisy levels without finetuning. Experiments on two widely recognized collaborative perception benchmarks showcase the superior performance of our method compared to the previous state-of-the-art approaches relying on BEV maps.
Paper Structure (19 sections, 17 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 19 sections, 17 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of using BEV map and point cluster as the basic interaction unit for collaborative perception. (Left) Channel compression and spatial selection applied on BEV maps for reducing bandwidth usage suffer from feature degradation and object loss, respectively; The aggregation process of BEV maps introduces unnecessary zero-padding and the computational complexity increases quadratically with the expansion of the communication range; Voxelization-induced implicit structure representation communication can lead to inaccurate prediction of box boundaries. (Right) We can control the bandwidth usage of point clusters by sampling important points instead of compressing the high-level cluster feature; Message aggregation based on point cluster only related to the number of potential objects in the scene; We can complete the spatial structure of objects with the low-level point coordinates for predictions with high precision.
  • Figure 2: Overall architecture of our method. Point Cluster Encoder (PCE) extracts point cluster representations of each agent's observation. Then a Point Cluster Packing (PCP) module is applied to filter noisy point clusters and correct the point coordinates in those kept. We can control the bandwidth usage by reducing the number of points involved in point clusters. After receiving messages from other agents, we address the time latency and pose error problems, and transform all point clusters to the coordinate space of the ego agent. Finally, we use a Point Cluster Aggregation (PCA) module to complete object information and output the predictions.
  • Figure 3: Illustration of histogram of all targets and proportions of those belonging to SP-O, SP-E, and CP categories in the test set of DAIR-V2X-C.
  • Figure 4: Comparison with state-of-the-art methods on the test sets of DAIR-V2X-C considering the performance-bandwidth trade-off.
  • Figure 5: Comparison with state-of-the-art methods on the test sets of V2XSet with pose noises following Gaussian distribution with standard deviations from $\{0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6\}$ for heading error ($^\circ$) and positional error ($m$), respectively.
  • ...and 4 more figures