Table of Contents
Fetching ...

ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

Chandan Yeshwanth, David Rozenberszki, Angela Dai

TL;DR

ExCap3D addresses the need for rich, multilevel descriptions of 3D scenes by jointly generating object-level and part-level captions for each detected object. The approach uses a two-head captioning architecture where part-level details inform object-level captions through cross-attention and shared hidden-state information, reinforced by semantic and textual consistency losses. A large ExCap3D Dataset, built on ScanNet++, provides 190k captions for 34k objects via multi-view VLM aggregation, enabling robust training and evaluation. Experiments show ExCap3D outperforms state-of-the-art 3D captioning baselines on CIDEr (and related metrics) for both object- and part-level descriptions, demonstrating the value of multilevel, consistent, and information-sharing captioning in 3D scene understanding.

Abstract

Generating text descriptions of objects in 3D indoor scenes is an important building block of embodied understanding. Existing methods do this by describing objects at a single level of detail, which often does not capture fine-grained details such as varying textures, materials, and shapes of the parts of objects. We propose the task of expressive 3D captioning: given an input 3D scene, describe objects at multiple levels of detail: a high-level object description, and a low-level description of the properties of its parts. To produce such captions, we present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan, and for each detected object in the scan, generates a fine-grained collective description of the parts of the object, along with an object-level description conditioned on the part-level description. We design ExCap3D to encourage semantic consistency between the generated text descriptions, as well as textual similarity in the latent space, to further increase the quality of the generated captions. To enable this task, we generated the ExCap3D Dataset by leveraging a visual-language model (VLM) for multi-view captioning. The ExCap3D Dataset contains captions on the ScanNet++ dataset with varying levels of detail, comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes. Our experiments show that the object- and part-level of detail captions generated by ExCap3D are of higher quality than those produced by state-of-the-art methods, with a Cider score improvement of 17% and 124% for object- and part-level details respectively. Our code, dataset and models will be made publicly available.

ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

TL;DR

ExCap3D addresses the need for rich, multilevel descriptions of 3D scenes by jointly generating object-level and part-level captions for each detected object. The approach uses a two-head captioning architecture where part-level details inform object-level captions through cross-attention and shared hidden-state information, reinforced by semantic and textual consistency losses. A large ExCap3D Dataset, built on ScanNet++, provides 190k captions for 34k objects via multi-view VLM aggregation, enabling robust training and evaluation. Experiments show ExCap3D outperforms state-of-the-art 3D captioning baselines on CIDEr (and related metrics) for both object- and part-level descriptions, demonstrating the value of multilevel, consistent, and information-sharing captioning in 3D scene understanding.

Abstract

Generating text descriptions of objects in 3D indoor scenes is an important building block of embodied understanding. Existing methods do this by describing objects at a single level of detail, which often does not capture fine-grained details such as varying textures, materials, and shapes of the parts of objects. We propose the task of expressive 3D captioning: given an input 3D scene, describe objects at multiple levels of detail: a high-level object description, and a low-level description of the properties of its parts. To produce such captions, we present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan, and for each detected object in the scan, generates a fine-grained collective description of the parts of the object, along with an object-level description conditioned on the part-level description. We design ExCap3D to encourage semantic consistency between the generated text descriptions, as well as textual similarity in the latent space, to further increase the quality of the generated captions. To enable this task, we generated the ExCap3D Dataset by leveraging a visual-language model (VLM) for multi-view captioning. The ExCap3D Dataset contains captions on the ScanNet++ dataset with varying levels of detail, comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes. Our experiments show that the object- and part-level of detail captions generated by ExCap3D are of higher quality than those produced by state-of-the-art methods, with a Cider score improvement of 17% and 124% for object- and part-level details respectively. Our code, dataset and models will be made publicly available.

Paper Structure

This paper contains 46 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: ExCap3D produces descriptions at multiple level of details for objects in an input 3D scan. For each detected object, we generate both object- and part-level of detail captions. The generated captions are consistent across the two levels and contain rich details.
  • Figure 2: Descriptions of objects in existing datasets such as Scan2Cap and ScanQA largely contain relations between objects, and limited local details at a single level. In contrast, the ExCap3D Dataset describes both the object as a whole and as a sum of its parts.
  • Figure 3: Overview of our ExCap3D captioning method. We predict 3D instances in the input 3D scene using Mask3D schult2023mask3d, then predict object-level and part-levels of detail using two separate captioning heads. The object-level captions are further constrained by the part-level captioner's hidden states. Semantic- and textual consistency losses are applied to ensure the overall consistency of both predicted captions.
  • Figure 4: Generation of ExCap3D Dataset. We use the ground truth 3D semantics in ScanNet++, project them onto multiview DSLR images and obtain descriptions of the image crops $c_{i,j}$ using a VLM. For parts crops, we use pseudo-ground truth from MaskClustering. Finally, we summarize the captions from different views using an LLM.
  • Figure 5: Qualitative evaluation on ScanNet++ yeshwanthliu2023scannetpp scenes, in comparison with D3Net chen2022d3net, Vote2CAP-DETR chen2023vote2cap, and PQ3D zhu2024unifyingpq3d. Predicted text from methods are denoted by color to indicate correct and incorrect generated phrases compared to the ground truth. Our method produces more consistent and detailed captions at both the object- and part-levels of detail. * indicates truncated output where the underlined phrase was predicted repeatedly.
  • ...and 1 more figures