Table of Contents
Fetching ...

3EED: Ground Everything Everywhere in 3D

Rong Li, Yuhao Dong, Tianshuai Hu, Ao Liang, Youquan Liu, Dongyue Lu, Liang Pan, Lingdong Kong, Junwei Liang, Ziwei Liu

TL;DR

3EED introduces a large-scale outdoor 3D visual grounding benchmark covering Vehicle, Drone, and Quadruped platforms to address the lack of cross-platform, multi-modal grounding in open-world environments. It combines synchronized RGB-LiDAR data with a scalable, human-validated annotation pipeline and platform-aware normalization to enable robust cross-platform learning. The authors propose a unified baseline with CPA, MSS, and SAF that significantly enhances cross-platform grounding performance and reduces domain gaps compared to indoor-focused baselines. Extensive experiments reveal notable cross-platform generalization gaps in existing methods and demonstrate the value of diverse, multi-platform supervision for robust 3D language grounding. The dataset and toolkit are released to foster future research in language-driven 3D embodied perception.

Abstract

Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We provide over 128,000 objects and 22,000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-language model prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.

3EED: Ground Everything Everywhere in 3D

TL;DR

3EED introduces a large-scale outdoor 3D visual grounding benchmark covering Vehicle, Drone, and Quadruped platforms to address the lack of cross-platform, multi-modal grounding in open-world environments. It combines synchronized RGB-LiDAR data with a scalable, human-validated annotation pipeline and platform-aware normalization to enable robust cross-platform learning. The authors propose a unified baseline with CPA, MSS, and SAF that significantly enhances cross-platform grounding performance and reduces domain gaps compared to indoor-focused baselines. Extensive experiments reveal notable cross-platform generalization gaps in existing methods and demonstrate the value of diverse, multi-platform supervision for robust 3D language grounding. The dataset and toolkit are released to foster future research in language-driven 3D embodied perception.

Abstract

Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We provide over 128,000 objects and 22,000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-language model prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.

Paper Structure

This paper contains 44 sections, 2 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Multi-modal, multi-platform 3D grounding from 3EED. Given a scene and a structured natural language expression, the task is to localize the referred object in 3D space. Our dataset captures diverse embodied robot sensing viewpoints from Vehicle, Drone, and Quadruped platforms, presenting unique challenges in spatial reasoning, scene analysis, and cross-platform 3D generalization.
  • Figure 2: Overview of annotation workflow. Left: We collect 3D boxes using multi-detector fusion, tracking, filtering, and manual verification across platforms. Middle: Referring expressions are produced by prompting a VLM with structured cues (class, status, position, relations), followed by rule-based rewriting and human refinement. Right: Platform-specific word clouds highlight distinct linguistic patterns in descriptions across vehicle, drone, and quadruped agents.
  • Figure 3: Examples of multi-platform 3D grounding from the 3EED dataset. There are clear discrepancies across both sensory data (2D & 3D) and referring expressions from the Vehicle, Drone, and Quadruped platforms.
  • Figure 4: Dataset statistics of the three platforms in 3EED. Left: Target bounding box distributions in polar coordinates. Color intensity indicates the frequency of targets in each $(\rho, \theta^r)$ bin. Middle: Scene distribution for train/val splits on each platform, along with per-scene object count histograms. Right: Elevation distributions of input point cloud, $p^z$, reflecting view-dependent elevation biases.
  • Figure 5: Examples of multi-object 3D grounding from the 3EED dataset. Given a scene and a multi-object expression, the goal of this task is to localize the 3D bounding box of each referred object by reasoning over both semantic attributes and inter-object spatial relationships.
  • ...and 12 more figures