Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

Yongdong Luo; Haojia Lin; Xiawu Zheng; Yigeng Jiang; Fei Chao; Jie Hu; Guannan Jiang; Songan Zhang; Rongrong Ji

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

Yongdong Luo, Haojia Lin, Xiawu Zheng, Yigeng Jiang, Fei Chao, Jie Hu, Guannan Jiang, Songan Zhang, Rongrong Ji

TL;DR

Rethinking 3D Dense Caption and Visual Grounding in a Unified Framework through Prompt-based Localization introduces 3DGCTR, a DETR-like single-stage architecture that unifies 3D Visual Grounding and 3D Dense Captioning. It augments a mature 3DVG backbone with a lightweight Dual-Clued Captioner and a Caption Text Prompt to enable end-to-end multitask training, leveraging prompt-based grounding to connect both tasks. The model integrates PointMetaBase for improved visual feature extraction and demonstrates state-of-the-art results on ScanRefer and Nr3D referit3d, with mutual gains observed when training VG and DC jointly. These findings highlight the practical potential of prompt-driven localization to advance holistic 3D scene understanding in indoor environments.

Abstract

3D Visual Grounding (3DVG) and 3D Dense Captioning (3DDC) are two crucial tasks in various 3D applications, which require both shared and complementary information in localization and visual-language relationships. Therefore, existing approaches adopt the two-stage "detect-then-describe/discriminate" pipeline, which relies heavily on the performance of the detector, resulting in suboptimal performance. Inspired by DETR, we propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks in an end-to-end fashion. The key idea is to reconsider the prompt-based localization ability of the 3DVG model. In this way, the 3DVG model with a well-designed prompt as input can assist the 3DDC task by extracting localization information from the prompt. In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection, effectively harnessing the existing 3DVG model's inherent localization capacity, thereby boosting 3DDC capability. This integration facilitates simultaneous multi-task training on both tasks, mutually enhancing their performance. Extensive experimental results demonstrate the effectiveness of this approach. Specifically, on the ScanRefer dataset, 3DGCTR surpasses the state-of-the-art 3DDC method by 4.3% in CIDEr@0.5IoU in MLE training and improves upon the SOTA 3DVG method by 3.16% in Acc@0.25IoU. The codes are at https://github.com/Leon1207/3DGCTR.

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

TL;DR

Abstract

Paper Structure (18 sections, 6 equations, 3 figures, 6 tables)

This paper contains 18 sections, 6 equations, 3 figures, 6 tables.

Introduction
Related Work
3D Visual Grounding
3D Dense Captioning
Method
Preliminaries: 3DVG Model
Integrating PointMetaBase into EDA
Caption Head
Caption Text Prompt
Multi-task End-to-end Training
Inference
Experiments
DataSets and Metrics
Implementation Details
Quantitative Comparison
...and 3 more sections

Figures (3)

Figure 1: Illustration of existing method (a) with two-stage pipeline and our single-stage 3DGCTR (b). Existing methods heavily depend on a detector’s output and also suffer from low reuse of task-agnostic modules. Therefore, we propose a transformer-based model that simply builds upon a mature 3DVG model, thus giving 3DVG model 3DDC capability. Compared to the SOTA method 3DJCG 3djcg that jointly trains the two tasks, our method achieves a significant improvement.
Figure 2: The framework of 3DGCTR builds upon a mature DETR-like 3DVG model (bottom in the figure) with a caption head. After obtaining the fused visual tokens $V$ and decoder output query embeddings $Q$ of each scene, the caption head uses $Q$ as caption prefix to identify the described region, and contextual features $V$ surrounding the vote query to complement with more surrounding information for more descriptive caption generation. Finally, the referring/detection boxes are selected from the candidate boxes via the referring scores.
Figure 3: Qualitative Comparisons. We compare qualitative results with two state-of-the-art methods in 3DVG (left part in the figure) and 3DDC (right part in the figure) tasks, EDA eda and Vote2Cap-DETR vote2cap. We mark correct attribute words in green and wrong descriptions in red. Our method produces right bounding boxes close to ground truth annotations and produces accurate descriptions of object attributes, classes and spatial relationships.

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

TL;DR

Abstract

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

Authors

TL;DR

Abstract

Table of Contents

Figures (3)