ALIGN: Advanced Query Initialization with LiDAR-Image Guidance for Occlusion-Robust 3D Object Detection
Janghyun Baek, Mincheol Chang, Seokha Moon, Seung Joon Lee, Jinkyu Kim
TL;DR
ALIGN tackles occlusion and crowded-scene challenges in multi-modal 3D object detection by replacing heuristic query initialization with a structured, object-aware approach. The framework introduces Occlusion-aware Center Estimation (OCE), Adaptive Neighbor Sampling (ANS), and Dynamic Query Balancing (DQB) to generate object-centric, semantically grounded, and background anchors, respectively, which are integrated into existing transformer-based detectors. Across nuScenes, ALIGN yields consistent mAP and NDS gains, most notably for models with uniform initialization and under heavy occlusion, with up to +0.9 mAP and +1.2 NDS. The method is modular and model-agnostic, adding manageable computational overhead and offering a practical route to more robust 3D perception in real-world driving scenarios.
Abstract
Recent query-based 3D object detection methods using camera and LiDAR inputs have shown strong performance, but existing query initialization strategies,such as random sampling or BEV heatmap-based sampling, often result in inefficient query usage and reduced accuracy, particularly for occluded or crowded objects. To address this limitation, we propose ALIGN (Advanced query initialization with LiDAR and Image GuidaNce), a novel approach for occlusion-robust, object-aware query initialization. Our model consists of three key components: (i) Occlusion-aware Center Estimation (OCE), which integrates LiDAR geometry and image semantics to estimate object centers accurately (ii) Adaptive Neighbor Sampling (ANS), which generates object candidates from LiDAR clustering and supplements each object by sampling spatially and semantically aligned points around it and (iii) Dynamic Query Balancing (DQB), which adaptively balances queries between foreground and background regions. Our extensive experiments on the nuScenes benchmark demonstrate that ALIGN consistently improves performance across multiple state-of-the-art detectors, achieving gains of up to +0.9 mAP and +1.2 NDS, particularly in challenging scenes with occlusions or dense crowds. Our code will be publicly available upon publication.
