AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring
Xinyi Wang, Na Zhao, Zhiyuan Han, Dan Guo, Xun Yang
TL;DR
AugRefer tackles the data scarcity and contextual reasoning bottlenecks in 3D visual grounding by introducing cross-modal augmentation that inserts objects into 3D scenes, renders multi-granular views, and generates accurate captions, paired with a Language-Spatial Adaptive Decoder that integrates language cues with both global and pairwise spatial relations. The LSAD enhances grounding by applying cross-, global, and pairwise spatial attention at each decoder layer, improving the discrimination of referents amidst distractors. Empirically, AugRefer consistently improves state-of-the-art baselines (e.g., BUTD-DETR and EDA) across ScanRefer, Nr3D, and Sr3D, achieving SOTA results on Nr3D and Sr3D and demonstrating strong gains from multi-level augmentation and spatial reasoning. The approach is modular and compatible with existing 3DVG models, offering a practical path to richer training signals and more accurate grounding in real-world 3D scenes.
Abstract
3D visual grounding (3DVG), which aims to correlate a natural language description with the target object within a 3D scene, is a significant yet challenging task. Despite recent advancements in this domain, existing approaches commonly encounter a shortage: a limited amount and diversity of text3D pairs available for training. Moreover, they fall short in effectively leveraging different contextual clues (e.g., rich spatial relations within the 3D visual space) for grounding. To address these limitations, we propose AugRefer, a novel approach for advancing 3D visual grounding. AugRefer introduces cross-modal augmentation designed to extensively generate diverse text-3D pairs by placing objects into 3D scenes and creating accurate and semantically rich descriptions using foundation models. Notably, the resulting pairs can be utilized by any existing 3DVG methods for enriching their training data. Additionally, AugRefer presents a language-spatial adaptive decoder that effectively adapts the potential referring objects based on the language description and various 3D spatial relations. Extensive experiments on three benchmark datasets clearly validate the effectiveness of AugRefer.
