Table of Contents
Fetching ...

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

Bu Jin, Yupeng Zheng, Pengfei Li, Weize Li, Yuhang Zheng, Sujie Hu, Xinyu Liu, Jinwei Zhu, Zhijie Yan, Haiyang Sun, Kun Zhan, Peng Jia, Xiaoxiao Long, Yilun Chen, Hao Zhao

TL;DR

This work defines outdoor 3D dense captioning as a new task that densely detects and describes 3D objects from LiDAR and panoramic RGB inputs. It introduces the TOD$^3$Cap network, which fuses LiDAR and multi-view images in BEV, uses a Relation Q-Former to capture context, and employs a frozen LLM via a LLaMA-Adapter to generate object-centric captions, obviating the need to retrain large language models. A million-scale TOD$^3$Cap dataset is presented, containing 2.3M captions for about $63.4k$ outdoor instances across 850 scenes, built with a semi-automatic annotation pipeline and rigorous human verification. Empirical results show TOD$^3$Cap outperforming adapted indoor baselines across multi-modal inputs, with notable CIDEr gains at $0.25$ and $0.5$ IoU, and extensive ablations validate the effectiveness of the Relation Q-Former, language decoders, and training strategies. The work provides a strong foundation for outdoor 3D vision-language research and supplies code, data, and models to the community.

Abstract

3D dense captioning stands as a cornerstone in achieving a comprehensive understanding of 3D scenes through natural language. It has recently witnessed remarkable achievements, particularly in indoor settings. However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1) the domain gap between indoor and outdoor scenes, such as dynamics and sparse visual inputs, makes it difficult to directly adapt existing indoor methods; 2) the lack of data with comprehensive box-caption pair annotations specifically tailored for outdoor scenes. To this end, we introduce the new task of outdoor 3D dense captioning. As input, we assume a LiDAR point cloud and a set of RGB images captured by the panoramic camera rig. The expected output is a set of object boxes with captions. To tackle this task, we propose the TOD3Cap network, which leverages the BEV representation to generate object box proposals and integrates Relation Q-Former with LLaMA-Adapter to generate rich captions for these objects. We also introduce the TOD3Cap dataset, the largest one to our knowledge for 3D dense captioning in outdoor scenes, which contains 2.3M descriptions of 64.3K outdoor objects from 850 scenes. Notably, our TOD3Cap network can effectively localize and caption 3D objects in outdoor scenes, which outperforms baseline methods by a significant margin (+9.6 CiDEr@0.5IoU). Code, data, and models are publicly available at https://github.com/jxbbb/TOD3Cap.

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

TL;DR

This work defines outdoor 3D dense captioning as a new task that densely detects and describes 3D objects from LiDAR and panoramic RGB inputs. It introduces the TODCap network, which fuses LiDAR and multi-view images in BEV, uses a Relation Q-Former to capture context, and employs a frozen LLM via a LLaMA-Adapter to generate object-centric captions, obviating the need to retrain large language models. A million-scale TODCap dataset is presented, containing 2.3M captions for about outdoor instances across 850 scenes, built with a semi-automatic annotation pipeline and rigorous human verification. Empirical results show TODCap outperforming adapted indoor baselines across multi-modal inputs, with notable CIDEr gains at and IoU, and extensive ablations validate the effectiveness of the Relation Q-Former, language decoders, and training strategies. The work provides a strong foundation for outdoor 3D vision-language research and supplies code, data, and models to the community.

Abstract

3D dense captioning stands as a cornerstone in achieving a comprehensive understanding of 3D scenes through natural language. It has recently witnessed remarkable achievements, particularly in indoor settings. However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1) the domain gap between indoor and outdoor scenes, such as dynamics and sparse visual inputs, makes it difficult to directly adapt existing indoor methods; 2) the lack of data with comprehensive box-caption pair annotations specifically tailored for outdoor scenes. To this end, we introduce the new task of outdoor 3D dense captioning. As input, we assume a LiDAR point cloud and a set of RGB images captured by the panoramic camera rig. The expected output is a set of object boxes with captions. To tackle this task, we propose the TOD3Cap network, which leverages the BEV representation to generate object box proposals and integrates Relation Q-Former with LLaMA-Adapter to generate rich captions for these objects. We also introduce the TOD3Cap dataset, the largest one to our knowledge for 3D dense captioning in outdoor scenes, which contains 2.3M descriptions of 64.3K outdoor objects from 850 scenes. Notably, our TOD3Cap network can effectively localize and caption 3D objects in outdoor scenes, which outperforms baseline methods by a significant margin (+9.6 CiDEr@0.5IoU). Code, data, and models are publicly available at https://github.com/jxbbb/TOD3Cap.
Paper Structure (52 sections, 11 equations, 19 figures, 9 tables)

This paper contains 52 sections, 11 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: We introduce the task of 3D dense captioning in outdoor scenes (right). Given point clouds (right middle) and multi-view RGB inputs (right top), we predict box-caption pairs of all objects in a 3D outdoor scene. There are several fundamental domain gaps (middle column) between indoor and outdoor scenes, including Status, Point Cloud, Perspective, and Scene Area, bringing new challenges specific to outdoor scenes. Meanwhile, our outdoor 3D dense captioning (right bottom) contains more comprehensive concepts than indoor scenes (left bottom).
  • Figure 2: The statistical properties of $\text{TOD}^3$Cap dataset. (a) The word cloud visualization of $\text{TOD}^3$Cap. (b) The visualization of object-environment relationship. (c) Statistics (in percentage) of the top 70 frequent words. (d) Sentence length Distribution.
  • Figure 3: Architecture of our proposed $\text{TOD}^3$Cap network. Firstly, BEV features are extracted from 3D LiDAR point cloud and 2D multi-view images, followed by a query-based detection head that generates a set of 3D object proposals from the BEV features. Secondly, to capture the relationship information, we utilize a Relation Q-Former where the objects interact with other objects and the surrounding environment to get the context-aware features. Finally, with an Adapter, the features are processed to be prompts for the language model to generate dense captions. This formulation does not require a re-training process of the language model.
  • Figure 4: Qualitative results for our proposed $\text{TOD}^3$Cap network. In the top left, we show our predicted bounding boxes and corresponding captions in the first row and ground truth in the second row. In the top right, we show our predicted bounding boxes in blue and the ground truth bounding boxes in red. In the bottom, we mark the wrong descriptions in red. The $\text{TOD}^3$Cap network produces impressive results except for a few mistakes.
  • Figure 5: $\text{TOD}^3$Cap: Towards 3D Dense Captioning in outdoor scenes.
  • ...and 14 more figures