Table of Contents
Fetching ...

ALIGN: Advanced Query Initialization with LiDAR-Image Guidance for Occlusion-Robust 3D Object Detection

Janghyun Baek, Mincheol Chang, Seokha Moon, Seung Joon Lee, Jinkyu Kim

TL;DR

ALIGN tackles occlusion and crowded-scene challenges in multi-modal 3D object detection by replacing heuristic query initialization with a structured, object-aware approach. The framework introduces Occlusion-aware Center Estimation (OCE), Adaptive Neighbor Sampling (ANS), and Dynamic Query Balancing (DQB) to generate object-centric, semantically grounded, and background anchors, respectively, which are integrated into existing transformer-based detectors. Across nuScenes, ALIGN yields consistent mAP and NDS gains, most notably for models with uniform initialization and under heavy occlusion, with up to +0.9 mAP and +1.2 NDS. The method is modular and model-agnostic, adding manageable computational overhead and offering a practical route to more robust 3D perception in real-world driving scenarios.

Abstract

Recent query-based 3D object detection methods using camera and LiDAR inputs have shown strong performance, but existing query initialization strategies,such as random sampling or BEV heatmap-based sampling, often result in inefficient query usage and reduced accuracy, particularly for occluded or crowded objects. To address this limitation, we propose ALIGN (Advanced query initialization with LiDAR and Image GuidaNce), a novel approach for occlusion-robust, object-aware query initialization. Our model consists of three key components: (i) Occlusion-aware Center Estimation (OCE), which integrates LiDAR geometry and image semantics to estimate object centers accurately (ii) Adaptive Neighbor Sampling (ANS), which generates object candidates from LiDAR clustering and supplements each object by sampling spatially and semantically aligned points around it and (iii) Dynamic Query Balancing (DQB), which adaptively balances queries between foreground and background regions. Our extensive experiments on the nuScenes benchmark demonstrate that ALIGN consistently improves performance across multiple state-of-the-art detectors, achieving gains of up to +0.9 mAP and +1.2 NDS, particularly in challenging scenes with occlusions or dense crowds. Our code will be publicly available upon publication.

ALIGN: Advanced Query Initialization with LiDAR-Image Guidance for Occlusion-Robust 3D Object Detection

TL;DR

ALIGN tackles occlusion and crowded-scene challenges in multi-modal 3D object detection by replacing heuristic query initialization with a structured, object-aware approach. The framework introduces Occlusion-aware Center Estimation (OCE), Adaptive Neighbor Sampling (ANS), and Dynamic Query Balancing (DQB) to generate object-centric, semantically grounded, and background anchors, respectively, which are integrated into existing transformer-based detectors. Across nuScenes, ALIGN yields consistent mAP and NDS gains, most notably for models with uniform initialization and under heavy occlusion, with up to +0.9 mAP and +1.2 NDS. The method is modular and model-agnostic, adding manageable computational overhead and offering a practical route to more robust 3D perception in real-world driving scenarios.

Abstract

Recent query-based 3D object detection methods using camera and LiDAR inputs have shown strong performance, but existing query initialization strategies,such as random sampling or BEV heatmap-based sampling, often result in inefficient query usage and reduced accuracy, particularly for occluded or crowded objects. To address this limitation, we propose ALIGN (Advanced query initialization with LiDAR and Image GuidaNce), a novel approach for occlusion-robust, object-aware query initialization. Our model consists of three key components: (i) Occlusion-aware Center Estimation (OCE), which integrates LiDAR geometry and image semantics to estimate object centers accurately (ii) Adaptive Neighbor Sampling (ANS), which generates object candidates from LiDAR clustering and supplements each object by sampling spatially and semantically aligned points around it and (iii) Dynamic Query Balancing (DQB), which adaptively balances queries between foreground and background regions. Our extensive experiments on the nuScenes benchmark demonstrate that ALIGN consistently improves performance across multiple state-of-the-art detectors, achieving gains of up to +0.9 mAP and +1.2 NDS, particularly in challenging scenes with occlusions or dense crowds. Our code will be publicly available upon publication.

Paper Structure

This paper contains 29 sections, 10 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Query Initialization Comparison. Existing 3D object detectors typically adopt following query initialization schemes: (a) random initialization, where queries are sampled from uniformly distributed spatial regions, or (b) heatmap-based sampling from salient regions identified in BEV heatmaps. In contrast, (c) ALIGN accurately estimates object centers and samples queries in the vicinity of each center, while maintaining balanced background queries.
  • Figure 2: An overview of our proposed method, ALIGN, which enhances query-based 3D object detectors using a novel query initialization strategy. Our model consists of three main components: (i) Occlusion-aware Center Estimation (OCE) for accurate object center estimation from LiDAR point cloud and image segmentation, (ii) Adaptive Neighbor Sampling (ANS) for generating object candidates via LiDAR clustering and sampling object-aware queries around each object, and (iii) Dynamic Query Balancing (DQB) for augmenting background queries, ensuring a balanced foreground–background distribution.
  • Figure 3: Detailed illustration of each modules. (a) OCE predicts object centers by integrating LiDAR geometry and image semantics, improving localization under occlusions. (b) ANS samples segmentation-aligned neighbors around each LiDAR cluster core for expands anchor coverage and feature diversity. (c) DQB balances object-centric and background queries by scene complexity, retaining fixed portion of background anchors for spatial coverage. These modules jointly optimize query initialization, leading to robust 3D object detection.
  • Figure 4: Qualitative Comparison of 3D Object Detection Results with and without ALIGN. ( (a) BEV visualizations show that ALIGN improves localization of small and occluded objects compared to the baseline, reducing missed detections and refining box placement. (b) Our method detects heavily occluded pedestrians missed by the baseline. (c) In crowded scenes, our method better resolves overlaps, improving object seperation and robustness. (d) Detection is more accurate for distant and small instances. Red dotted regions highlight areas where ALIGN outperforms the baseline.
  • Figure 5: Qualitative comparison across GT, baseline, and CMT with ALIGN. ALIGN improves detection of occluded and crowded objects (red dotted regions), localizing barriers and occluded pedestrians missed by the baseline. All models struggle with small, heavily occluded traffic cones, indicating room for further improvement.
  • ...and 1 more figures