Transcrib3D: 3D Referring Expression Resolution through Large Language Models

Jiading Fang; Xiangshan Tan; Shengjie Lin; Igor Vasiljevic; Vitor Guizilini; Hongyuan Mei; Rares Ambrus; Gregory Shakhnarovich; Matthew R Walter

Transcrib3D: 3D Referring Expression Resolution through Large Language Models

Jiading Fang, Xiangshan Tan, Shengjie Lin, Igor Vasiljevic, Vitor Guizilini, Hongyuan Mei, Rares Ambrus, Gregory Shakhnarovich, Matthew R Walter

TL;DR

Transcrib3D tackles 3D referring expression grounding for embodied agents by using text as a bridge between 3D detections and LLM reasoning. The method converts 3D scene detections into an object-centric transcript, filters candidates, and employs an iterative code-generation–reasoning loop with a Python interpreter, guided by general principles and refined through self-reasoned corrections. It achieves state-of-the-art results on ReferIt3D and ScanRefer and demonstrates real-robot pick-and-place capabilities, with edge-friendly fine-tuning that narrows the gap to GPT-4. This work suggests that a text-based grounding paradigm can reduce reliance on costly multi-modal representations and paves the way for data-efficient, deployable 3D grounding in robotics.

Abstract

If robots are to work effectively alongside people, they must be able to interpret natural language references to objects in their 3D environment. Understanding 3D referring expressions is challenging -- it requires the ability to both parse the 3D structure of the scene and correctly ground free-form language in the presence of distraction and clutter. We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models (LLMs). Transcrib3D uses text as the unifying medium, which allows us to sidestep the need to learn shared representations connecting multi-modal inputs, which would require massive amounts of annotated 3D data. As a demonstration of its effectiveness, Transcrib3D achieves state-of-the-art results on 3D reference resolution benchmarks, with a great leap in performance from previous multi-modality baselines. To improve upon zero-shot performance and facilitate local deployment on edge computers and robots, we propose self-correction for fine-tuning that trains smaller models, resulting in performance close to that of large models. We show that our method enables a real robot to perform pick-and-place tasks given queries that contain challenging referring expressions. Project site is at https://ripl.github.io/Transcrib3D.

Transcrib3D: 3D Referring Expression Resolution through Large Language Models

TL;DR

Abstract

Paper Structure (17 sections, 6 figures, 3 tables)

This paper contains 17 sections, 6 figures, 3 tables.

Introduction
Related Work
Grounding Large Language Models
LLM Reasoning
Methodology
Detect and Transcribe 3D Information
Pre-Filtering Relevant Objects for Utterance
Iterative Code Generation and Reasoning
Principles-Guided Zero-Shot Prompting
Fine-tuning from Self-Reasoned Correction
Experiments
Grounding Accuracy on ReferIt3D
Grounding Accuracy on ScanRefer
Effects of Fine-tuning Methods
Referring Expressions for Robot Manipulation
...and 2 more sections

Figures (6)

Figure 1: The overall Transcrib3D framework, which takes as input the colored point-cloud and referring expression (in green), and outputs the ID or bounding box of the referent object. To resolve the referring expression "the chair in the corner of the room, between the white and yellow desks", the framework needs to locate the pillow in the green box, while all other pillows in red boxes are distractors.
Figure 2: Transcrib3D enables a robot to resolve complex 3D referring expressions necessary to to follow pick-and-place instructions. In this example, the robot is tasked with a natural language instruction that includes challenging referring expressions "cover the toy duckie surrounded by the cups with the black cup farthest from the shortest cup".
Figure 3: Illustration of the iterative code generation and reasoning process. After the code generation, execution results from a local Python interpreter is fed back to the LLM for further reasoning. The LLM then proceeds to either 1) fix code errors when encountered, 2) generate additional code to obtain more information, or 3) output the referred objects if it has all the needed information. This process continues until the LLM believes the reasoning to be complete.
Figure 4: Qualitative comparisons between Transcrib3D (ours, in green) and 3D-VisTA (in red) on the NR3D dataset.
Figure 5: Qualitative comparison of the grounding performance of (top) CaP and (bottom) CaP+Transcrib3D on a real robot.
...and 1 more figures

Transcrib3D: 3D Referring Expression Resolution through Large Language Models

TL;DR

Abstract

Transcrib3D: 3D Referring Expression Resolution through Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)