NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving
Fuhao Li, Huan Jin, Bin Gao, Liaoyuan Fan, Lihui Jiang, Long Zeng
TL;DR
NuGrounding tackles the lack of large-scale, diverse multi-view 3D grounding benchmarks in autonomous driving by introducing NuGrounding and the Hierarchy of Grounding (HoG) prompts. It proposes a novel framework that fuses instruction understanding from multi-modal LLMs with precise 3D localization from a BEV-based detector, using a context query aggregator and a fusion decoder to produce accurate 3D bounding boxes. The approach achieves substantial improvements over adapted baselines, reporting $P=0.59$ and $R=0.64$ with notable gains over prior methods, and demonstrates strong data efficiency and generalization across four instruction levels. This work provides a scalable dataset and a robust cross-modal grounding paradigm with implications for open-vocabulary grounding and planning in autonomous driving.
Abstract
Multi-view 3D visual grounding is critical for autonomous driving vehicles to interpret natural languages and localize target objects in complex environments. However, existing datasets and methods suffer from coarse-grained language instructions, and inadequate integration of 3D geometric reasoning with linguistic comprehension. To this end, we introduce NuGrounding, the first large-scale benchmark for multi-view 3D visual grounding in autonomous driving. We present a Hierarchy of Grounding (HoG) method to construct NuGrounding to generate hierarchical multi-level instructions, ensuring comprehensive coverage of human instruction patterns. To tackle this challenging dataset, we propose a novel paradigm that seamlessly combines instruction comprehension abilities of multi-modal LLMs (MLLMs) with precise localization abilities of specialist detection models. Our approach introduces two decoupled task tokens and a context query to aggregate 3D geometric information and semantic instructions, followed by a fusion decoder to refine spatial-semantic feature fusion for precise localization. Extensive experiments demonstrate that our method significantly outperforms the baselines adapted from representative 3D scene understanding methods by a significant margin and achieves 0.59 in precision and 0.64 in recall, with improvements of 50.8% and 54.7%.
