Table of Contents
Fetching ...

NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving

Fuhao Li, Huan Jin, Bin Gao, Liaoyuan Fan, Lihui Jiang, Long Zeng

TL;DR

NuGrounding tackles the lack of large-scale, diverse multi-view 3D grounding benchmarks in autonomous driving by introducing NuGrounding and the Hierarchy of Grounding (HoG) prompts. It proposes a novel framework that fuses instruction understanding from multi-modal LLMs with precise 3D localization from a BEV-based detector, using a context query aggregator and a fusion decoder to produce accurate 3D bounding boxes. The approach achieves substantial improvements over adapted baselines, reporting $P=0.59$ and $R=0.64$ with notable gains over prior methods, and demonstrates strong data efficiency and generalization across four instruction levels. This work provides a scalable dataset and a robust cross-modal grounding paradigm with implications for open-vocabulary grounding and planning in autonomous driving.

Abstract

Multi-view 3D visual grounding is critical for autonomous driving vehicles to interpret natural languages and localize target objects in complex environments. However, existing datasets and methods suffer from coarse-grained language instructions, and inadequate integration of 3D geometric reasoning with linguistic comprehension. To this end, we introduce NuGrounding, the first large-scale benchmark for multi-view 3D visual grounding in autonomous driving. We present a Hierarchy of Grounding (HoG) method to construct NuGrounding to generate hierarchical multi-level instructions, ensuring comprehensive coverage of human instruction patterns. To tackle this challenging dataset, we propose a novel paradigm that seamlessly combines instruction comprehension abilities of multi-modal LLMs (MLLMs) with precise localization abilities of specialist detection models. Our approach introduces two decoupled task tokens and a context query to aggregate 3D geometric information and semantic instructions, followed by a fusion decoder to refine spatial-semantic feature fusion for precise localization. Extensive experiments demonstrate that our method significantly outperforms the baselines adapted from representative 3D scene understanding methods by a significant margin and achieves 0.59 in precision and 0.64 in recall, with improvements of 50.8% and 54.7%.

NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving

TL;DR

NuGrounding tackles the lack of large-scale, diverse multi-view 3D grounding benchmarks in autonomous driving by introducing NuGrounding and the Hierarchy of Grounding (HoG) prompts. It proposes a novel framework that fuses instruction understanding from multi-modal LLMs with precise 3D localization from a BEV-based detector, using a context query aggregator and a fusion decoder to produce accurate 3D bounding boxes. The approach achieves substantial improvements over adapted baselines, reporting and with notable gains over prior methods, and demonstrates strong data efficiency and generalization across four instruction levels. This work provides a scalable dataset and a robust cross-modal grounding paradigm with implications for open-vocabulary grounding and planning in autonomous driving.

Abstract

Multi-view 3D visual grounding is critical for autonomous driving vehicles to interpret natural languages and localize target objects in complex environments. However, existing datasets and methods suffer from coarse-grained language instructions, and inadequate integration of 3D geometric reasoning with linguistic comprehension. To this end, we introduce NuGrounding, the first large-scale benchmark for multi-view 3D visual grounding in autonomous driving. We present a Hierarchy of Grounding (HoG) method to construct NuGrounding to generate hierarchical multi-level instructions, ensuring comprehensive coverage of human instruction patterns. To tackle this challenging dataset, we propose a novel paradigm that seamlessly combines instruction comprehension abilities of multi-modal LLMs (MLLMs) with precise localization abilities of specialist detection models. Our approach introduces two decoupled task tokens and a context query to aggregate 3D geometric information and semantic instructions, followed by a fusion decoder to refine spatial-semantic feature fusion for precise localization. Extensive experiments demonstrate that our method significantly outperforms the baselines adapted from representative 3D scene understanding methods by a significant margin and achieves 0.59 in precision and 0.64 in recall, with improvements of 50.8% and 54.7%.

Paper Structure

This paper contains 30 sections, 4 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Comparison of various MLLM-based multi-view 3D grounding frameworks.(a) Localize target objects by generating textual response. (b) Directly decode the MLLM hidden embeddings for predicting 3D bounding box simply. (c) Ours: Fuse the semantic context query with the 3D spatial information to for precise 3D bounding box regressions.
  • Figure 2: Data construction flow of NuGrounding. First, we annotate diverse common attributes for each object. The category, movement and relationship attributes are generated by rule-based calculations using the official annotated information. The appearance is extracted from other datasets, and we manually check its correction. Then the proposed Hierarchy of Grounding (HoG) method is used to generate textual prompt across four levels.
  • Figure 3: Statistics of NuGrounding dataset. (a) Word cloud. It represents the top 50 words used in NuGrounding. (b) Proportion of hierarchical subsets. NuGrounding can be divided into four levels based on the number of selected attribute combinations. The size of the arc represents the proportions of each subset, while the same color indicates subsets of the same level. (c) Distribution of objects number per prompt. The horizontal axis represents the amount of objects corresponding to each prompt, while the vertical axis represents the amount of this kind of prompts.
  • Figure 4: Overall architecture of the proposed NuGrounding. It consists of three parts: a bev-based detector that provides visual embedding, a context query aggregator designed to accommodate both visual embedding and language instruction, and a fusion decoder that fuses the semantic context query and the 3d object query.
  • Figure 5: Visual comparison among NuGrounding (ours) and existing related works. In terms of the given language prompt, NuGrounding can detect the described objects even if they contain various challenges, like crossing different views and occluded.
  • ...and 7 more figures