Quantum Inverse Contextual Vision Transformers (Q-ICVT): A New Frontier in 3D Object Detection for AVs
Sanjay Bhargav Dharavath, Tanmoy Dam, Supriyo Chakraborty, Prithwiraj Roy, Aniruddha Maiti
TL;DR
The paper addresses the challenge of robust multi-modal fusion for 3D object detection in autonomous driving by proposing Q-ICVT, a two-stage transformer-based fusion framework. It introduces Global Adiabatic Transformer (GAT), a reversible transformer that aligns sparse LiDAR features $G_V$ with dense image features $G_I$ in a global context, and Sparse Expert of Local Fusion (SELF), a Mixture of Experts-based local fusion mechanism using gating networks to fuse RoI LiDAR features with dense image features. Ablation studies demonstrate that both GAT and SELF are essential for peak performance, and experiments on the Waymo Open Dataset show state-of-the-art mAPH improvements, including notable gains in L2 difficulty across vehicles, pedestrians, and cyclists. Overall, Q-ICVT advances AV perception by effectively integrating global and local cross-modal information, with practical impact on detecting distant objects and improving fusion reliability.
Abstract
The field of autonomous vehicles (AVs) predominantly leverages multi-modal integration of LiDAR and camera data to achieve better performance compared to using a single modality. However, the fusion process encounters challenges in detecting distant objects due to the disparity between the high resolution of cameras and the sparse data from LiDAR. Insufficient integration of global perspectives with local-level details results in sub-optimal fusion performance.To address this issue, we have developed an innovative two-stage fusion process called Quantum Inverse Contextual Vision Transformers (Q-ICVT). This approach leverages adiabatic computing in quantum concepts to create a novel reversible vision transformer known as the Global Adiabatic Transformer (GAT). GAT aggregates sparse LiDAR features with semantic features in dense images for cross-modal integration in a global form. Additionally, the Sparse Expert of Local Fusion (SELF) module maps the sparse LiDAR 3D proposals and encodes position information of the raw point cloud onto the dense camera feature space using a gating point fusion approach. Our experiments show that Q-ICVT achieves an mAPH of 82.54 for L2 difficulties on the Waymo dataset, improving by 1.88% over current state-of-the-art fusion methods. We also analyze GAT and SELF in ablation studies to highlight the impact of Q-ICVT. Our code is available at https://github.com/sanjay-810/Qicvt Q-ICVT
