Table of Contents
Fetching ...

Quantum Inverse Contextual Vision Transformers (Q-ICVT): A New Frontier in 3D Object Detection for AVs

Sanjay Bhargav Dharavath, Tanmoy Dam, Supriyo Chakraborty, Prithwiraj Roy, Aniruddha Maiti

TL;DR

The paper addresses the challenge of robust multi-modal fusion for 3D object detection in autonomous driving by proposing Q-ICVT, a two-stage transformer-based fusion framework. It introduces Global Adiabatic Transformer (GAT), a reversible transformer that aligns sparse LiDAR features $G_V$ with dense image features $G_I$ in a global context, and Sparse Expert of Local Fusion (SELF), a Mixture of Experts-based local fusion mechanism using gating networks to fuse RoI LiDAR features with dense image features. Ablation studies demonstrate that both GAT and SELF are essential for peak performance, and experiments on the Waymo Open Dataset show state-of-the-art mAPH improvements, including notable gains in L2 difficulty across vehicles, pedestrians, and cyclists. Overall, Q-ICVT advances AV perception by effectively integrating global and local cross-modal information, with practical impact on detecting distant objects and improving fusion reliability.

Abstract

The field of autonomous vehicles (AVs) predominantly leverages multi-modal integration of LiDAR and camera data to achieve better performance compared to using a single modality. However, the fusion process encounters challenges in detecting distant objects due to the disparity between the high resolution of cameras and the sparse data from LiDAR. Insufficient integration of global perspectives with local-level details results in sub-optimal fusion performance.To address this issue, we have developed an innovative two-stage fusion process called Quantum Inverse Contextual Vision Transformers (Q-ICVT). This approach leverages adiabatic computing in quantum concepts to create a novel reversible vision transformer known as the Global Adiabatic Transformer (GAT). GAT aggregates sparse LiDAR features with semantic features in dense images for cross-modal integration in a global form. Additionally, the Sparse Expert of Local Fusion (SELF) module maps the sparse LiDAR 3D proposals and encodes position information of the raw point cloud onto the dense camera feature space using a gating point fusion approach. Our experiments show that Q-ICVT achieves an mAPH of 82.54 for L2 difficulties on the Waymo dataset, improving by 1.88% over current state-of-the-art fusion methods. We also analyze GAT and SELF in ablation studies to highlight the impact of Q-ICVT. Our code is available at https://github.com/sanjay-810/Qicvt Q-ICVT

Quantum Inverse Contextual Vision Transformers (Q-ICVT): A New Frontier in 3D Object Detection for AVs

TL;DR

The paper addresses the challenge of robust multi-modal fusion for 3D object detection in autonomous driving by proposing Q-ICVT, a two-stage transformer-based fusion framework. It introduces Global Adiabatic Transformer (GAT), a reversible transformer that aligns sparse LiDAR features with dense image features in a global context, and Sparse Expert of Local Fusion (SELF), a Mixture of Experts-based local fusion mechanism using gating networks to fuse RoI LiDAR features with dense image features. Ablation studies demonstrate that both GAT and SELF are essential for peak performance, and experiments on the Waymo Open Dataset show state-of-the-art mAPH improvements, including notable gains in L2 difficulty across vehicles, pedestrians, and cyclists. Overall, Q-ICVT advances AV perception by effectively integrating global and local cross-modal information, with practical impact on detecting distant objects and improving fusion reliability.

Abstract

The field of autonomous vehicles (AVs) predominantly leverages multi-modal integration of LiDAR and camera data to achieve better performance compared to using a single modality. However, the fusion process encounters challenges in detecting distant objects due to the disparity between the high resolution of cameras and the sparse data from LiDAR. Insufficient integration of global perspectives with local-level details results in sub-optimal fusion performance.To address this issue, we have developed an innovative two-stage fusion process called Quantum Inverse Contextual Vision Transformers (Q-ICVT). This approach leverages adiabatic computing in quantum concepts to create a novel reversible vision transformer known as the Global Adiabatic Transformer (GAT). GAT aggregates sparse LiDAR features with semantic features in dense images for cross-modal integration in a global form. Additionally, the Sparse Expert of Local Fusion (SELF) module maps the sparse LiDAR 3D proposals and encodes position information of the raw point cloud onto the dense camera feature space using a gating point fusion approach. Our experiments show that Q-ICVT achieves an mAPH of 82.54 for L2 difficulties on the Waymo dataset, improving by 1.88% over current state-of-the-art fusion methods. We also analyze GAT and SELF in ablation studies to highlight the impact of Q-ICVT. Our code is available at https://github.com/sanjay-810/Qicvt Q-ICVT
Paper Structure (9 sections, 8 equations, 1 figure, 3 tables)

This paper contains 9 sections, 8 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Q-ICVT Pipeline: We have introduced two novel fusion blocks from extracted sparse LiDAR features ($G_V$) and dense image data ($G_I$). GAT is designed based on the adiabatic computing concept to match between the two modalities by global pointwise attention. Similarly, in SELF, the voxelized local RoI proposal feature($G_L$) is combined with a gating mechanism with $G_I$ at the local-level fusion.