LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent
Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F. Fouhey, Joyce Chai
TL;DR
Problem: zero-shot open-vocabulary 3D visual grounding struggles with compositional language and spatial relations when using CLIP-based grounders. Approach: LLM-Grounder deploys an LLM as an agent to decompose queries, plan steps, and orchestrate grounders (Target Finder, Landmark Finder) for grounded reasoning in 3D scenes without labeled data. Contributions: achieves state-of-the-art zero-shot ScanRefer performance, analyzes where LLMs provide the most benefit (complex text, lower visual complexity), and demonstrates the practicality of tool-using in robotic grounding. Significance: enables robust, adaptable grounding for navigation, manipulation, and Q&A in open-vocabulary 3D environments.
Abstract
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in handling complex language queries, we propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. LLM-Grounder utilizes an LLM to decompose complex natural language queries into semantic constituents and employs a visual grounding tool, such as OpenScene or LERF, to identify objects in a 3D scene. The LLM then evaluates the spatial and commonsense relations among the proposed objects to make a final grounding decision. Our method does not require any labeled training data and can generalize to novel 3D scenes and arbitrary text queries. We evaluate LLM-Grounder on the ScanRefer benchmark and demonstrate state-of-the-art zero-shot grounding accuracy. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries, making LLM-Grounder an effective approach for 3D vision-language tasks in robotics. Videos and interactive demos can be found on the project website https://chat-with-nerf.github.io/ .
