Table of Contents
Fetching ...

ActFormer: Scalable Collaborative Perception via Active Queries

Suozhi Huang, Juexiao Zhang, Yiming Li, Chen Feng

TL;DR

ActFormer is presented, a Transformer that learns bird’s eye view (BEV) representations by using predefined BEV queries to interact with multi-robot multi-camera inputs to enable a single robot to intelligently discern the relevance of the collaborators and their associated cameras according to a learned spatial prior.

Abstract

Collaborative perception leverages rich visual observations from multiple robots to extend a single robot's perception ability beyond its field of view. Many prior works receive messages broadcast from all collaborators, leading to a scalability challenge when dealing with a large number of robots and sensors. In this work, we aim to address \textit{scalable camera-based collaborative perception} with a Transformer-based architecture. Our key idea is to enable a single robot to intelligently discern the relevance of the collaborators and their associated cameras according to a learned spatial prior. This proactive understanding of the visual features' relevance does not require the transmission of the features themselves, enhancing both communication and computation efficiency. Specifically, we present ActFormer, a Transformer that learns bird's eye view (BEV) representations by using predefined BEV queries to interact with multi-robot multi-camera inputs. Each BEV query can actively select relevant cameras for information aggregation based on pose information, instead of interacting with all cameras indiscriminately. Experiments on the V2X-Sim dataset demonstrate that ActFormer improves the detection performance from 29.89% to 45.15% in terms of AP@0.7 with about 50% fewer queries, showcasing the effectiveness of ActFormer in multi-agent collaborative 3D object detection.

ActFormer: Scalable Collaborative Perception via Active Queries

TL;DR

ActFormer is presented, a Transformer that learns bird’s eye view (BEV) representations by using predefined BEV queries to interact with multi-robot multi-camera inputs to enable a single robot to intelligently discern the relevance of the collaborators and their associated cameras according to a learned spatial prior.

Abstract

Collaborative perception leverages rich visual observations from multiple robots to extend a single robot's perception ability beyond its field of view. Many prior works receive messages broadcast from all collaborators, leading to a scalability challenge when dealing with a large number of robots and sensors. In this work, we aim to address \textit{scalable camera-based collaborative perception} with a Transformer-based architecture. Our key idea is to enable a single robot to intelligently discern the relevance of the collaborators and their associated cameras according to a learned spatial prior. This proactive understanding of the visual features' relevance does not require the transmission of the features themselves, enhancing both communication and computation efficiency. Specifically, we present ActFormer, a Transformer that learns bird's eye view (BEV) representations by using predefined BEV queries to interact with multi-robot multi-camera inputs. Each BEV query can actively select relevant cameras for information aggregation based on pose information, instead of interacting with all cameras indiscriminately. Experiments on the V2X-Sim dataset demonstrate that ActFormer improves the detection performance from 29.89% to 45.15% in terms of AP@0.7 with about 50% fewer queries, showcasing the effectiveness of ActFormer in multi-agent collaborative 3D object detection.
Paper Structure (13 sections, 7 equations, 3 figures, 5 tables)

This paper contains 13 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Baseline v.s. ActFormer. Baseline densely attends BEV queries to every 2D image feature. ActFormer actively selects queries based on spatial information and achieves more scalable collaborative perception.
  • Figure 2: Method overview. After partners broadcast their pose information to the ego car, our approach leverages selective deformable attention to obtain active sparse queries for images. Selective deformable attention consists of two crucial components: (a) Pose-guided Selective Attention, which efficiently focuses on multi-agent image features using active queries, enabling us to emphasize regions of interest; and (b) Active Selection Network, which concatenates pose embeddings with BEV queries and produces an interest score map. Subsequently, this interest score map is multiplied by the BEV query using a gated network to obtain the active query. This process aims to enhance collaboration efficiency and generate active sparse queries.
  • Figure 3: (A) visualizations of the interest score map of ego vehicle and its 3 partners for 2 layers of PSA. (B) a comparison of the percentage of queries used versus the performance gain under AP@IoU evaluation. Act stands for ActFormer and coBEV stands for the baseline Co-BEVFormer.