Data-Efficient 3D Visual Grounding via Order-Aware Referring

Tung-Yu Wu; Sheng-Yu Huang; Yu-Chiang Frank Wang

Data-Efficient 3D Visual Grounding via Order-Aware Referring

Tung-Yu Wu, Sheng-Yu Huang, Yu-Chiang Frank Wang

TL;DR

This paper tackles data-efficient 3D visual grounding by introducing Vigor, which uses an LLM to derive a referential order from a natural language description and a sequence of Object-Referring blocks to progressively locate the target object in a 3D point cloud. The method leverages masked feature refinement and cross-attention guided by the referential order, along with a novel warm-up pre-training that synthesizes plausible anchor/target orders to stabilize learning with limited data. Empirical results on NR3D and ScanRefer show that Vigor achieves strong data-efficient grounding, surpassing state-of-the-art baselines in low-resource scenarios and demonstrating robustness to imperfect proposals. The approach offers a practical path to scalable 3D grounding without heavy manual annotation, enabling efficient deployment in real-world AR/robotics tasks.

Abstract

3D visual grounding aims to identify the target object within a 3D point cloud scene referred to by a natural language description. Previous works usually require significant data relating to point color and their descriptions to exploit the corresponding complicated verbo-visual relations. In our work, we introduce Vigor, a novel Data-Efficient 3D Visual Grounding framework via Order-aware Referring. Vigor leverages LLM to produce a desirable referential order from the input description for 3D visual grounding. With the proposed stacked object-referring blocks, the predicted anchor objects in the above order allow one to locate the target object progressively without supervision on the identities of anchor objects or exact relations between anchor/target objects. In addition, we present an order-aware warm-up training strategy, which augments referential orders for pre-training the visual grounding framework. This allows us to better capture the complex verbo-visual relations and benefit the desirable data-efficient learning scheme. Experimental results on the NR3D and ScanRefer datasets demonstrate our superiority in low-resource scenarios. In particular, Vigor surpasses current state-of-the-art frameworks by 9.3% and 7.6% grounding accuracy under 1% data and 10% data settings on the NR3D dataset, respectively. Our code is publicly available at https://github.com/tony10101105/Vigor.

Data-Efficient 3D Visual Grounding via Order-Aware Referring

TL;DR

Abstract

Paper Structure (49 sections, 3 equations, 8 figures, 9 tables, 2 algorithms)

This paper contains 49 sections, 3 equations, 8 figures, 9 tables, 2 algorithms.

Introduction
Related Work
2D Visual Grounding
3D Visual Grounding
Data-Efficient 3D Visual Grounding
Methodology
Problem Formulation and Model Overview
Problem formulation
Model overview
3D Visual Grounding with Order-Aware Object Referring
Object-referring blocks.
Object feature enhancement
Order-Aware Warm-up with Synthetic Referential Order
Augmenting plausible referential order and description
Warm-up objectives
...and 34 more sections

Figures (8)

Figure 1: Referential orders for 3D grounding. The order manifests an anchor-to-target referring process that helps the grounding model identify the target object described in the input.
Figure 2: Architecture of our 3D Visual Grounding Framework with Order-Aware Referring (Vigor). By taking a point cloud scene $C$ and a natural description $D$ as inputs, our Vigor produces a referential order of anchor/target objects $O_{1:B}$ and conduct Object-Referring blocks $R_{1:B}$ to locate the target object progressively.
Figure 3: Illustration of feature enhancement and synthesizing warmup data in Vigor.
Figure 4: 3D grounding examples of NR3D. Note that blue/green/red boxes denote ground truth/correct/incorrect predictions. While both MVT and Vigor fail on the last two cases, it is due to the fact that the size of the target object is extremely small (e.g., cup) and the description does not describe any anchor objects.
Figure 5: Quantitative results on NR3D We can see that when the amount of data is relatively many (above 30%), our Vigor is comparable to MVT+CoT3DRef bakr2023cot3dref. However, as the amount of data reduces, our Vigor performs better over MVT-CoT3DRef.
...and 3 more figures

Data-Efficient 3D Visual Grounding via Order-Aware Referring

TL;DR

Abstract

Data-Efficient 3D Visual Grounding via Order-Aware Referring

Authors

TL;DR

Abstract

Table of Contents

Figures (8)