Table of Contents
Fetching ...

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F. Fouhey, Joyce Chai

TL;DR

Problem: zero-shot open-vocabulary 3D visual grounding struggles with compositional language and spatial relations when using CLIP-based grounders. Approach: LLM-Grounder deploys an LLM as an agent to decompose queries, plan steps, and orchestrate grounders (Target Finder, Landmark Finder) for grounded reasoning in 3D scenes without labeled data. Contributions: achieves state-of-the-art zero-shot ScanRefer performance, analyzes where LLMs provide the most benefit (complex text, lower visual complexity), and demonstrates the practicality of tool-using in robotic grounding. Significance: enables robust, adaptable grounding for navigation, manipulation, and Q&A in open-vocabulary 3D environments.

Abstract

3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in handling complex language queries, we propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. LLM-Grounder utilizes an LLM to decompose complex natural language queries into semantic constituents and employs a visual grounding tool, such as OpenScene or LERF, to identify objects in a 3D scene. The LLM then evaluates the spatial and commonsense relations among the proposed objects to make a final grounding decision. Our method does not require any labeled training data and can generalize to novel 3D scenes and arbitrary text queries. We evaluate LLM-Grounder on the ScanRefer benchmark and demonstrate state-of-the-art zero-shot grounding accuracy. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries, making LLM-Grounder an effective approach for 3D vision-language tasks in robotics. Videos and interactive demos can be found on the project website https://chat-with-nerf.github.io/ .

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

TL;DR

Problem: zero-shot open-vocabulary 3D visual grounding struggles with compositional language and spatial relations when using CLIP-based grounders. Approach: LLM-Grounder deploys an LLM as an agent to decompose queries, plan steps, and orchestrate grounders (Target Finder, Landmark Finder) for grounded reasoning in 3D scenes without labeled data. Contributions: achieves state-of-the-art zero-shot ScanRefer performance, analyzes where LLMs provide the most benefit (complex text, lower visual complexity), and demonstrates the practicality of tool-using in robotic grounding. Significance: enables robust, adaptable grounding for navigation, manipulation, and Q&A in open-vocabulary 3D environments.

Abstract

3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in handling complex language queries, we propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. LLM-Grounder utilizes an LLM to decompose complex natural language queries into semantic constituents and employs a visual grounding tool, such as OpenScene or LERF, to identify objects in a 3D scene. The LLM then evaluates the spatial and commonsense relations among the proposed objects to make a final grounding decision. Our method does not require any labeled training data and can generalize to novel 3D scenes and arbitrary text queries. We evaluate LLM-Grounder on the ScanRefer benchmark and demonstrate state-of-the-art zero-shot grounding accuracy. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries, making LLM-Grounder an effective approach for 3D vision-language tasks in robotics. Videos and interactive demos can be found on the project website https://chat-with-nerf.github.io/ .
Paper Structure (11 sections, 5 figures, 2 tables)

This paper contains 11 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: In open-vocabulary 3D visual grounding task, CLIP-based models tend to treat text input as "bag of words", ignoring semantic structures of compositional text input, e.g., consisting of complex spatial relations among objects. On the top-right is a demonstration of such behavior when using OpenScene Peng2023OpenScene, a CLIP-based 3D grounding method, as a visual grounder. When asked to ground the spatially-informed text query "a chair between the dining table and window", it incorrectly highlights the dining table and window, which are not the target but rather referential landmarks (red bounding boxes). We propose to address this problem by leveraging a large language model (LLM) to 1. Deliberately generate a plan to decompose complex visual grounding queries into sub-tasks; 2. Orchestrate and interact with tools such as target finder and landmark finder to collect information; 3. Leverage spatial and commonsense knowledge to reflect on collected feedback from tools.
  • Figure 2: Overview of LLM-Grounder. Given a query to ground an object, our approach, backed by an LLM agent, reasons on the user's request and generates a plan to ground the object by using tools. The agent interacts with tools such as target find and landmark finder to gather information such as object bounding box, object volume, and distances to landmarks from the tools. This information is then returned to the agent to conduct further spatial and commonsense reasoning to rank, filter and select the best matching candidate.
  • Figure 3: Qualitative example. LLM agent uses spatial reasoning to successfully disambiguate the correct object instance.
  • Figure 4: Performance delta (w/ LLM - w/o LLM) vs. query text complexity. The LLM helps more when the text query is more complex but fails to help significantly at higher complexities.
  • Figure 5: Performance of various models vs query text complexity. All models struggle with more complex sentences, but models with an LLM agent perform better, especially at these higher complexities.