AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring

Xinyi Wang; Na Zhao; Zhiyuan Han; Dan Guo; Xun Yang

AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring

Xinyi Wang, Na Zhao, Zhiyuan Han, Dan Guo, Xun Yang

TL;DR

AugRefer tackles the data scarcity and contextual reasoning bottlenecks in 3D visual grounding by introducing cross-modal augmentation that inserts objects into 3D scenes, renders multi-granular views, and generates accurate captions, paired with a Language-Spatial Adaptive Decoder that integrates language cues with both global and pairwise spatial relations. The LSAD enhances grounding by applying cross-, global, and pairwise spatial attention at each decoder layer, improving the discrimination of referents amidst distractors. Empirically, AugRefer consistently improves state-of-the-art baselines (e.g., BUTD-DETR and EDA) across ScanRefer, Nr3D, and Sr3D, achieving SOTA results on Nr3D and Sr3D and demonstrating strong gains from multi-level augmentation and spatial reasoning. The approach is modular and compatible with existing 3DVG models, offering a practical path to richer training signals and more accurate grounding in real-world 3D scenes.

Abstract

3D visual grounding (3DVG), which aims to correlate a natural language description with the target object within a 3D scene, is a significant yet challenging task. Despite recent advancements in this domain, existing approaches commonly encounter a shortage: a limited amount and diversity of text3D pairs available for training. Moreover, they fall short in effectively leveraging different contextual clues (e.g., rich spatial relations within the 3D visual space) for grounding. To address these limitations, we propose AugRefer, a novel approach for advancing 3D visual grounding. AugRefer introduces cross-modal augmentation designed to extensively generate diverse text-3D pairs by placing objects into 3D scenes and creating accurate and semantically rich descriptions using foundation models. Notably, the resulting pairs can be utilized by any existing 3DVG methods for enriching their training data. Additionally, AugRefer presents a language-spatial adaptive decoder that effectively adapts the potential referring objects based on the language description and various 3D spatial relations. Extensive experiments on three benchmark datasets clearly validate the effectiveness of AugRefer.

AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring

TL;DR

Abstract

Paper Structure (17 sections, 4 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 17 sections, 4 equations, 9 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Methodology
Cross-Modal Augmentation
Object Insertion.
Hybrid Rendering.
Diverse Description Generation.
Overview of 3D Visual Grounder
Language-Spatial Adaptive Decoder
Experiments
Dataset and Experimental Setting
Overall Comparison
In-depth Studies
Conclusion
Implementation Details
...and 2 more sections

Figures (9)

Figure 1: A brief illustration of our proposed AugRefer: 1) Cross-Modal Augmentation: a brown wooden table is inserted into a living room scene, and generate its corresponding grounding description to increase data diversity. 2) 3D Visual Grounder: we leverage spatial relation-based referring to grounding the target.
Figure 2: The framework overview of AugRefer. It consists of two components: 1) Cross-Modal Augmentation with three steps: ① Object Insertion $\rightarrow$ ② Hybrid Rendering $\rightarrow$ ③ Caption Generation; and 2) 3D Visual Grounder, where our designed Language-Spatial Adaptive Decoder (LSAD) aims to enable more precise grounding by incorporating 3D spatial relations.
Figure 3: a) Multi-Angle Camera: For each level of the scene, images are captured from multiple angles. b) Multi-Level Rendering: The scene is rendered at different levels.
Figure 4: Multi-Level Caption Generation. Conversation process with BLIP2 and ChatGPT for captioning various level rendering images. Both the Local-Level and Scene-Level captions utilize the same set of prompts. We describe the approach using the Local-Level as an example.
Figure 5: Illustrations of a) Language-Spatial Adaptive Decoder (LSAD) layer and b) Global Spatial Attention (GSA).
...and 4 more figures

AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring

TL;DR

Abstract

AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring

Authors

TL;DR

Abstract

Table of Contents

Figures (9)