Table of Contents
Fetching ...

EMIFF: Enhanced Multi-scale Image Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

Zhe Wang, Siqi Fan, Xiaoliang Huo, Tongda Xu, Yan Wang, Jingjing Liu, Yilun Chen, Ya-Qin Zhang

TL;DR

EMIFF addresses pose errors from cross-view asynchrony and bandwidth constraints in VIC3D by introducing an intermediate fusion framework with Multi-scale Cross Attention and Camera-aware Channel Masking to enhance cross-view image features. A Feature Compression module reduces transmission load, and a Point-Sampling Voxel Fusion pipeline projects and fuses features into BEV for 3D detection. The approach yields state-of-the-art results on DAIR-V2X-C, outperforming early- and late-fusion methods while maintaining comparable transmission costs, and is shown to benefit from higher model capacity and targeted ablations. This work advances practical cooperative perception by balancing detection performance with communication efficiency and calibration robustness. Its techniques—MCA, CCM, and FC—offer a blueprint for robust, bandwidth-aware VIC3D systems in real-world deployments.

Abstract

In autonomous driving, cooperative perception makes use of multi-view cameras from both vehicles and infrastructure, providing a global vantage point with rich semantic context of road conditions beyond a single vehicle viewpoint. Currently, two major challenges persist in vehicle-infrastructure cooperative 3D (VIC3D) object detection: $1)$ inherent pose errors when fusing multi-view images, caused by time asynchrony across cameras; $2)$ information loss in transmission process resulted from limited communication bandwidth. To address these issues, we propose a novel camera-based 3D detection framework for VIC3D task, Enhanced Multi-scale Image Feature Fusion (EMIFF). To fully exploit holistic perspectives from both vehicles and infrastructure, we propose Multi-scale Cross Attention (MCA) and Camera-aware Channel Masking (CCM) modules to enhance infrastructure and vehicle features at scale, spatial, and channel levels to correct the pose error introduced by camera asynchrony. We also introduce a Feature Compression (FC) module with channel and spatial compression blocks for transmission efficiency. Experiments show that EMIFF achieves SOTA on DAIR-V2X-C datasets, significantly outperforming previous early-fusion and late-fusion methods with comparable transmission costs.

EMIFF: Enhanced Multi-scale Image Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

TL;DR

EMIFF addresses pose errors from cross-view asynchrony and bandwidth constraints in VIC3D by introducing an intermediate fusion framework with Multi-scale Cross Attention and Camera-aware Channel Masking to enhance cross-view image features. A Feature Compression module reduces transmission load, and a Point-Sampling Voxel Fusion pipeline projects and fuses features into BEV for 3D detection. The approach yields state-of-the-art results on DAIR-V2X-C, outperforming early- and late-fusion methods while maintaining comparable transmission costs, and is shown to benefit from higher model capacity and targeted ablations. This work advances practical cooperative perception by balancing detection performance with communication efficiency and calibration robustness. Its techniques—MCA, CCM, and FC—offer a blueprint for robust, bandwidth-aware VIC3D systems in real-world deployments.

Abstract

In autonomous driving, cooperative perception makes use of multi-view cameras from both vehicles and infrastructure, providing a global vantage point with rich semantic context of road conditions beyond a single vehicle viewpoint. Currently, two major challenges persist in vehicle-infrastructure cooperative 3D (VIC3D) object detection: inherent pose errors when fusing multi-view images, caused by time asynchrony across cameras; information loss in transmission process resulted from limited communication bandwidth. To address these issues, we propose a novel camera-based 3D detection framework for VIC3D task, Enhanced Multi-scale Image Feature Fusion (EMIFF). To fully exploit holistic perspectives from both vehicles and infrastructure, we propose Multi-scale Cross Attention (MCA) and Camera-aware Channel Masking (CCM) modules to enhance infrastructure and vehicle features at scale, spatial, and channel levels to correct the pose error introduced by camera asynchrony. We also introduce a Feature Compression (FC) module with channel and spatial compression blocks for transmission efficiency. Experiments show that EMIFF achieves SOTA on DAIR-V2X-C datasets, significantly outperforming previous early-fusion and late-fusion methods with comparable transmission costs.
Paper Structure (16 sections, 1 equation, 8 figures, 5 tables)

This paper contains 16 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Labels (3D bounding boxes) projected from 3D space to vehicle (a) and infrastructure (b) image planes using calibration parameters $P_{inf/veh}$ often suffer from misalignment between the ground truth and the projection position in 2D images (as illustrated by the misaligned green bounding boxes). The reason for this misalignment is that the camera's capture time $T_{inf/veh}$ are different and the moving object captured from the vehicle camera (in green) and infrastructure camera (in red) will appear at different locations.
  • Figure 2: The general framework of EMIFF. Separate image backbone and neck extract multi-scale image feature from vehicle and infrastructure images. FC module compresses source infrastructure feature $f^{S}_{inf}$ and decompresses it to multi-scale ones $f^{M}_{inf}$. MCA module consisting of MFC and MFS blocks enhances multi-scale features $f^{M}_{veh/inf}$ by seeking the correlation between the two sides, and CCM takes camera parameters $(R,t,K)$ as input to reweight features $f_{veh/inf}$ with channel relationship. Finally, Point-Sampling Voxel Fusion projects image features $f^{\prime}_{veh/inf}$ into 3D space to generate a unified voxel feature $V_{vic}$, which can be applied to 3D neck and head in turn for detection prediction.
  • Figure 3: Illustration of FC module. Feature $f_{inf}^{S}$ is compressed into $f_{inf}^{T}$ through the channel and spatial compressors, which is transmitted to vehicle and is decoded into $f_{inf}^{S\prime}$ through the channel and spatial decompressors. Finally, multi-scale infrastructure features $f_{inf}^{M}$ can be recovered from $f_{inf}^{S\prime}$ with several Conv Blocks with stride 2.
  • Figure 4: Details of MFC. Every pixel-wise feature is integrated with the spatial information of surrounding pixels via DCN, and multi-scale features are scaled to the same size through UpConv blocks.
  • Figure 5: Schema of MCA module. In the lower branch, vehicle feature $f_{veh}$ is generated from $f^{M}_{veh}$ through MFC Block and Mean. In the upper branch, $f^{M}_{inf}$ is refined into 'key' through MFC Block and MeanPooling, and queries are generated from $f_{veh}$ through MeanPooling. The output weights $\omega_{inf}^{m}$ of cross-attention are applied to $\hat{f}^{M}_{inf}$ with inner product to form infrastructure feature$f_{inf}$.
  • ...and 3 more figures