Table of Contents
Fetching ...

HEAD: A Bandwidth-Efficient Cooperative Perception Approach for Heterogeneous Connected and Autonomous Vehicles

Deyuan Qu, Qi Chen, Yongqi Zhu, Yihao Zhu, Sergei S. Avedisov, Song Fu, Qing Yang

TL;DR

HEAD tackles the bandwidth-accuracy tradeoff in cooperative perception by fusing only features from the classification and regression heads, enabling cross-modality collaboration among heterogeneous detectors. It introduces a heterogeneous fusion network with distinct backbones (e.g., PointPillars and SECOND) and a homogeneous fusion network that uses self-attention for classification and a complementary fusion module for regression. Experiments on OPV2V and V2V4Real show HEAD outperforms late fusion while drastically reducing bandwidth, achieving performance close to intermediate fusion methods with an order of magnitude less data exchanged. The approach offers practical benefits for real-time, cross-vehicle perception in diverse vehicle fleets and sensor configurations.

Abstract

In cooperative perception studies, there is often a trade-off between communication bandwidth and perception performance. While current feature fusion solutions are known for their excellent object detection performance, transmitting the entire sets of intermediate feature maps requires substantial bandwidth. Furthermore, these fusion approaches are typically limited to vehicles that use identical detection models. Our goal is to develop a solution that supports cooperative perception across vehicles equipped with different modalities of sensors. This method aims to deliver improved perception performance compared to late fusion techniques, while achieving precision similar to the state-of-art intermediate fusion, but requires an order of magnitude less bandwidth. We propose HEAD, a method that fuses features from the classification and regression heads in 3D object detection networks. Our method is compatible with heterogeneous detection networks such as LiDAR PointPillars, SECOND, VoxelNet, and camera Bird's-eye View (BEV) Encoder. Given the naturally smaller feature size in the detection heads, we design a self-attention mechanism to fuse the classification head and a complementary feature fusion layer to fuse the regression head. Our experiments, comprehensively evaluated on the V2V4Real and OPV2V datasets, demonstrate that HEAD is a fusion method that effectively balances communication bandwidth and perception performance.

HEAD: A Bandwidth-Efficient Cooperative Perception Approach for Heterogeneous Connected and Autonomous Vehicles

TL;DR

HEAD tackles the bandwidth-accuracy tradeoff in cooperative perception by fusing only features from the classification and regression heads, enabling cross-modality collaboration among heterogeneous detectors. It introduces a heterogeneous fusion network with distinct backbones (e.g., PointPillars and SECOND) and a homogeneous fusion network that uses self-attention for classification and a complementary fusion module for regression. Experiments on OPV2V and V2V4Real show HEAD outperforms late fusion while drastically reducing bandwidth, achieving performance close to intermediate fusion methods with an order of magnitude less data exchanged. The approach offers practical benefits for real-time, cross-vehicle perception in diverse vehicle fleets and sensor configurations.

Abstract

In cooperative perception studies, there is often a trade-off between communication bandwidth and perception performance. While current feature fusion solutions are known for their excellent object detection performance, transmitting the entire sets of intermediate feature maps requires substantial bandwidth. Furthermore, these fusion approaches are typically limited to vehicles that use identical detection models. Our goal is to develop a solution that supports cooperative perception across vehicles equipped with different modalities of sensors. This method aims to deliver improved perception performance compared to late fusion techniques, while achieving precision similar to the state-of-art intermediate fusion, but requires an order of magnitude less bandwidth. We propose HEAD, a method that fuses features from the classification and regression heads in 3D object detection networks. Our method is compatible with heterogeneous detection networks such as LiDAR PointPillars, SECOND, VoxelNet, and camera Bird's-eye View (BEV) Encoder. Given the naturally smaller feature size in the detection heads, we design a self-attention mechanism to fuse the classification head and a complementary feature fusion layer to fuse the regression head. Our experiments, comprehensively evaluated on the V2V4Real and OPV2V datasets, demonstrate that HEAD is a fusion method that effectively balances communication bandwidth and perception performance.
Paper Structure (12 sections, 7 equations, 9 figures, 2 tables)

This paper contains 12 sections, 7 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of proposed HEAD. This is a bandwidth-efficient cooperative perception framework for heterogeneous system. All connected and autonomous vehicles can achieve effective cooperative perception without changing their individual perception performance.
  • Figure 2: An overview of the heterogeneous fusion network. Blue Ego vehicle encoder correspond to the PointPillars backbone, whereas green $\textit{CAV}_{j}$ vehicle encoder correspond to the SECOND backbone.
  • Figure 3: Visualizing classification and regression maps. The ego vehicle and sender vehicle use Pointpillars and SECOND to process point cloud respectively. Figure \ref{['fig:gradients']}(a) and (d) represent the classification and regression maps of ego vehicle. Similarly, Figure \ref{['fig:gradients']}(b) and (e) correspond to the classification and regression maps of sender vehicle.
  • Figure 4: An overview of the homogeneous fusion network. Both the Ego vehicle encoder and the $\textit{CAV}_{j}$ vehicle encoder correspond to the PointPillars backbone.
  • Figure 5: Self-attention fusion and Complementary feature fusion.
  • ...and 4 more figures