A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

Ting Yu; Xiaojun Lin; Shuhui Wang; Weiguo Sheng; Qingming Huang; Jun Yu

A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

Ting Yu, Xiaojun Lin, Shuhui Wang, Weiguo Sheng, Qingming Huang, Jun Yu

TL;DR

A comprehensive review of 3D dense captioning is provided, covering task definition, architecture classification, dataset analysis, evaluation metrics, and in-depth prosperity discussions, and proposes a series of promising future directions for 3D dense captioning.

Abstract

Three-Dimensional (3D) dense captioning is an emerging vision-language bridging task that aims to generate multiple detailed and accurate descriptions for 3D scenes. It presents significant potential and challenges due to its closer representation of the real world compared to 2D visual captioning, as well as complexities in data collection and processing of 3D point cloud sources. Despite the popularity and success of existing methods, there is a lack of comprehensive surveys summarizing the advancements in this field, which hinders its progress. In this paper, we provide a comprehensive review of 3D dense captioning, covering task definition, architecture classification, dataset analysis, evaluation metrics, and in-depth prosperity discussions. Based on a synthesis of previous literature, we refine a standard pipeline that serves as a common paradigm for existing methods. We also introduce a clear taxonomy of existing models, summarize technologies involved in different modules, and conduct detailed experiment analysis. Instead of a chronological order introduction, we categorize the methods into different classes to facilitate exploration and analysis of the differences and connections among existing techniques. We also provide a reading guideline to assist readers with different backgrounds and purposes in reading efficiently. Furthermore, we propose a series of promising future directions for 3D dense captioning by identifying challenges and aligning them with the development of related tasks, offering valuable insights and inspiring future research in this field. Our aim is to provide a comprehensive understanding of 3D dense captioning, foster further investigations, and contribute to the development of novel applications in multimedia and related domains.

A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

TL;DR

Abstract

Paper Structure (36 sections, 9 equations, 10 figures, 6 tables)

This paper contains 36 sections, 9 equations, 10 figures, 6 tables.

Introduction
reading guidelines
Related Work
Image Captioning
Dense Image Captioning
Dense Video Captioning
3D Visual Grounding
Architecture
Task Definition
Main Framework
Scene Encoder
Relation Module
Feature Decoder
Model Classification
Research Focus
...and 21 more sections

Figures (10)

Figure 1: Illustration of a 3D dense captioning task: localizing and describing objects in 3D scenes. The task involves the combined process of object localization and captioning to generate natural language descriptions for objects in a 3D scene. It takes a 3D point cloud source as input and produces a diverse range of bounding boxes along with multiple descriptions.
Figure 2: Visualization of 3D point cloud and 2D image. The inherent single viewpoint of 2D images inevitably triggers a misaligned, distorted, and obscured appearance. Compared to 2D images with uniform pixels and grids, 3D point clouds are represented by sparse and disordered points, providing comprehensive geometric information, including physical characteristics such as size and shape and prosperous spatial relationships from multiple rotatable perspectives.
Figure 3: Evolution of 3D dense captioning: datasets and models over time. Two fundamental datasets ScanRefer chen2020scanrefer and Nr3D achlioptas2020referit3d, have played a pivotal role in shaping the field. The ScanRefer dataset was initially curated for the 3D visual grounding task, and Nr3D is a sub-dataset of ReferIt3D achlioptas2020referit3d, comprising human-annotated 3D scenes. In subsequent years, the field witnessed a surge of novel models, including Scan2Cap chen2021scan2cap, D3Net takahashi2020d3net, X-Trans2Cap yuan2022x, SpaCap3D wang2022spatiality, MORE jiao2022more, 3DJCG cai20223djcg, CM3D zhong2022contextual, and Vote2Cap-DETR chen2023end. Notably, the order of these models is based on the commit dates on the benchmark, as the exact appearance times of papers were ambiguous.
Figure 4: Illustration of representative visual-language bridging tasks related to 3D dense captioning. Both image captioning and dense image captioning involve using a 2D image as input to generate natural language descriptions. However, they differ in their objectives. Image captioning aims to generate a sentence that describes the overall content of the input image. On the other hand, dense captioning, as a variation of image captioning, focuses on generating distinctive descriptions for prominent regions within an image, capturing fine-grained details. 3D visual grounding is a task that emphasizes localizing the object described by an input sentence within a 3D scene. It involves linking the language-based description to the corresponding 3D object in the scene, bridging the gap between visual perception and natural language understanding. Dense video captioning involves localizing significant events in an untrimmed video and generating captions for each event. Among the four tasks mentioned, dense image captioning exhibits the closest resemblance to 3D dense captioning. Conversely, 3D visual grounding stands in contrast to 3D dense captioning, emphasizing different aspects of visual understanding and language integration.
Figure 5: The flowchart of the general framework for 3D dense captioning, which typically encompasses three main components: a scene encoder, a relation module, and a feature decoder. The relation module is a core component that is commonly employed in most existing works, rendering it an integral part of the encoder-decoder structure. Specifically, the input to the framework is a 3D point cloud containing $N$ randomly sampled points, each characterized by $F$-dimensional features. The model then generates $I$ pairs of bounding boxes along with multiple descriptions $(B_i, D_i)$ as outputs. In the intermediate process, most models generate $M$ object proposals with $F'$-dimensional feature representations using a scene encoder. These proposals are then utilized to learn the relationships between objects in the relation module. Finally, the feature decoder generates corresponding descriptions for the objects. Notably, a smaller portion of methods do not follow this approach and instead perform detection and captioning simultaneously in the final feature encoding step.
...and 5 more figures

A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

TL;DR

Abstract

A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

Authors

TL;DR

Abstract

Table of Contents

Figures (10)