ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

Chandan Yeshwanth; David Rozenberszki; Angela Dai

ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

Chandan Yeshwanth, David Rozenberszki, Angela Dai

TL;DR

ExCap3D addresses the need for rich, multilevel descriptions of 3D scenes by jointly generating object-level and part-level captions for each detected object. The approach uses a two-head captioning architecture where part-level details inform object-level captions through cross-attention and shared hidden-state information, reinforced by semantic and textual consistency losses. A large ExCap3D Dataset, built on ScanNet++, provides 190k captions for 34k objects via multi-view VLM aggregation, enabling robust training and evaluation. Experiments show ExCap3D outperforms state-of-the-art 3D captioning baselines on CIDEr (and related metrics) for both object- and part-level descriptions, demonstrating the value of multilevel, consistent, and information-sharing captioning in 3D scene understanding.

Abstract

Generating text descriptions of objects in 3D indoor scenes is an important building block of embodied understanding. Existing methods do this by describing objects at a single level of detail, which often does not capture fine-grained details such as varying textures, materials, and shapes of the parts of objects. We propose the task of expressive 3D captioning: given an input 3D scene, describe objects at multiple levels of detail: a high-level object description, and a low-level description of the properties of its parts. To produce such captions, we present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan, and for each detected object in the scan, generates a fine-grained collective description of the parts of the object, along with an object-level description conditioned on the part-level description. We design ExCap3D to encourage semantic consistency between the generated text descriptions, as well as textual similarity in the latent space, to further increase the quality of the generated captions. To enable this task, we generated the ExCap3D Dataset by leveraging a visual-language model (VLM) for multi-view captioning. The ExCap3D Dataset contains captions on the ScanNet++ dataset with varying levels of detail, comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes. Our experiments show that the object- and part-level of detail captions generated by ExCap3D are of higher quality than those produced by state-of-the-art methods, with a Cider score improvement of 17% and 124% for object- and part-level details respectively. Our code, dataset and models will be made publicly available.

ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

TL;DR

Abstract

ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)