IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception

Shaohong Wang; Lu Bin; Xinyu Xiao; Zhiyu Xiang; Hangguan Shan; Eryun Liu

IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception

Shaohong Wang, Lu Bin, Xinyu Xiao, Zhiyu Xiang, Hangguan Shan, Eryun Liu

TL;DR

This paper addresses camera-based visual collaborative perception by enabling efficient instance-level sharing among agents. IFTR, a transformer-based framework, integrates instance feature aggregation, bandwidth-aware message selection, and cross-domain query adaptation with a deformable DETR head to produce BEV-aware detections. It demonstrates substantial AP@70 improvements across DAIR-V2X, OPV2V, and V2XSet and shows robust performance under localization noise with reduced communication costs. The work offers practical, scalable improvements for budget-constrained multi-agent perception and provides code at the project URL: https://github.com/wangsh0111/IFTR.

Abstract

Multi-agent collaborative perception has emerged as a widely recognized technology in the field of autonomous driving in recent years. However, current collaborative perception predominantly relies on LiDAR point clouds, with significantly less attention given to methods using camera images. This severely impedes the development of budget-constrained collaborative systems and the exploitation of the advantages offered by the camera modality. This work proposes an instance-level fusion transformer for visual collaborative perception (IFTR), which enhances the detection performance of camera-only collaborative perception systems through the communication and sharing of visual features. To capture the visual information from multiple agents, we design an instance feature aggregation that interacts with the visual features of individual agents using predefined grid-shaped bird eye view (BEV) queries, generating more comprehensive and accurate BEV features. Additionally, we devise a cross-domain query adaptation as a heuristic to fuse 2D priors, implicitly encoding the candidate positions of targets. Furthermore, IFTR optimizes communication efficiency by sending instance-level features, achieving an optimal performance-bandwidth trade-off. We evaluate the proposed IFTR on a real dataset, DAIR-V2X, and two simulated datasets, OPV2V and V2XSet, achieving performance improvements of 57.96%, 9.23% and 12.99% in AP@70 metrics compared to the previous SOTAs, respectively. Extensive experiments demonstrate the superiority of IFTR and the effectiveness of its key components. The code is available at https://github.com/wangsh0111/IFTR.

IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception

TL;DR

Abstract

Paper Structure (25 sections, 1 equation, 5 figures, 3 tables)

This paper contains 25 sections, 1 equation, 5 figures, 3 tables.

Introduction
Related Work
Camera-based 3D Object Detection
Collaborative Perception
IFTR
Image Encoder
Message Selection and Feature Map Reconstruction
Instance Feature Aggregation
Cross-Domain Query Adaptation
Object Decoder
Experiments
Experimental Setup
Datasets.
Implementation details.
Evaluation metrics.
...and 10 more sections

Figures (5)

Figure 1: Overall architecture of IFTR. i) We employ the message selection and feature map reconstruction module to share instance-level features, reducing bandwidth consumption; ii) In instance feature aggregation (IFA), each BEV query interacts only with image features from regions of interest from multiple views; iii) In CDQA, we encode the feature map information and 3D positional information of each instance into 3D object query.
Figure 2: (a) The architecture of our proposed Instance Feature Aggregation (IFA); (b) Multi-View Feature Aggregation (MVFA) illustrated in \ref{['sec:ifa']}
Figure 3: IFTR steadily improves 3D detection performance as the number of agents grows. (a) The relationship between 3D detection performance and the maximum collaboration count on the OPV2V test set; (b) The relationship between 3D detection performance and collaboration count in a certain scene on the OPV2V test set.
Figure 4: Robustness to localization noise on the DAIR-V2X and V2XSet datasets. Gaussian noise with zero mean and a varying variance is introduced. IFTR consistently outperforms previous SOTAs.
Figure 5: Visualization of predictions from (a) No Collaboration, (b) Late Fusion and (c) IFTR on the OPV2V test set. Green and red 3D bounding boxes represent the ground truth and prediction respectively.

IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception

TL;DR

Abstract

IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception

Authors

TL;DR

Abstract

Table of Contents

Figures (5)