Table of Contents
Fetching ...

Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions

Jintang Xue, Ganning Zhao, Jie-En Yao, Hong-En Chen, Yue Hu, Meida Chen, Suya You, C. -C. Jay Kuo

TL;DR

Descrip3D addresses the limited relational reasoning in 3D scene understanding by attaching object-level textual descriptions capturing intrinsic attributes and spatial relations. It integrates these descriptions at embedding and prompt levels, enabling unified reasoning across grounding, captioning, and QA without task-specific heads. Empirical results on five benchmarks show consistent improvements, particularly for multi-object and relational reasoning tasks, and ablations confirm the utility of dual-level integration. The approach demonstrates that lightweight linguistic descriptions can provide a scalable and interpretable bridge between vision and language in complex indoor environments.

Abstract

Understanding 3D scenes goes beyond simply recognizing objects; it requires reasoning about the spatial and semantic relationships between them. Current 3D scene-language models often struggle with this relational understanding, particularly when visual embeddings alone do not adequately convey the roles and interactions of objects. In this paper, we introduce Descrip3D, a novel and powerful framework that explicitly encodes the relationships between objects using natural language. Unlike previous methods that rely only on 2D and 3D embeddings, Descrip3D enhances each object with a textual description that captures both its intrinsic attributes and contextual relationships. These relational cues are incorporated into the model through a dual-level integration: embedding fusion and prompt-level injection. This allows for unified reasoning across various tasks such as grounding, captioning, and question answering, all without the need for task-specific heads or additional supervision. When evaluated on five benchmark datasets, including ScanRefer, Multi3DRefer, ScanQA, SQA3D, and Scan2Cap, Descrip3D consistently outperforms strong baseline models, demonstrating the effectiveness of language-guided relational representation for understanding complex indoor scenes. Our code and data are publicly available at https://github.com/jintangxue/Descrip3D.

Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions

TL;DR

Descrip3D addresses the limited relational reasoning in 3D scene understanding by attaching object-level textual descriptions capturing intrinsic attributes and spatial relations. It integrates these descriptions at embedding and prompt levels, enabling unified reasoning across grounding, captioning, and QA without task-specific heads. Empirical results on five benchmarks show consistent improvements, particularly for multi-object and relational reasoning tasks, and ablations confirm the utility of dual-level integration. The approach demonstrates that lightweight linguistic descriptions can provide a scalable and interpretable bridge between vision and language in complex indoor environments.

Abstract

Understanding 3D scenes goes beyond simply recognizing objects; it requires reasoning about the spatial and semantic relationships between them. Current 3D scene-language models often struggle with this relational understanding, particularly when visual embeddings alone do not adequately convey the roles and interactions of objects. In this paper, we introduce Descrip3D, a novel and powerful framework that explicitly encodes the relationships between objects using natural language. Unlike previous methods that rely only on 2D and 3D embeddings, Descrip3D enhances each object with a textual description that captures both its intrinsic attributes and contextual relationships. These relational cues are incorporated into the model through a dual-level integration: embedding fusion and prompt-level injection. This allows for unified reasoning across various tasks such as grounding, captioning, and question answering, all without the need for task-specific heads or additional supervision. When evaluated on five benchmark datasets, including ScanRefer, Multi3DRefer, ScanQA, SQA3D, and Scan2Cap, Descrip3D consistently outperforms strong baseline models, demonstrating the effectiveness of language-guided relational representation for understanding complex indoor scenes. Our code and data are publicly available at https://github.com/jintangxue/Descrip3D.

Paper Structure

This paper contains 41 sections, 3 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: An example of injecting object-level text descriptions during conversation. Providing these descriptions significantly improves the model’s accuracy and reasoning performance.
  • Figure 2: Overall model architecture. We propose a novel and powerful method that explicitly models inter-object relationships by integrating relational text descriptions into object-centric scene representations via a dual-level strategy. From a 3D scan, we extract object proposals and encode their geometry and appearance using pretrained 2D and 3D encoders. Each object is enriched with a natural language description capturing both intrinsic attributes and spatial relations to nearby objects. These descriptions guide scene understanding through: (1) embedding-level fusion with visual features to enhance object representations, and (2) prompt-level injection of queried object descriptions to enhance object-specific relational reasoning. The resulting multimodal tokens enable high-level reasoning for 3D grounding, dense captioning, and question answering. Our design equips the model with both localized and contextual spatial semantics, significantly improving relational reasoning.
  • Figure 3: Qualitative comparison of 3D scene understanding tasks. Descrip3D outperforms Chat-Scene, especially in cases involving complex spatial grounding or multi-object reasoning, due to its use of a dual-level integrated relational textual descriptions that enhance contextual understanding.
  • Figure 4: Qualitative examples of object-level relational descriptions generated using Prompt A (Default) with LLaVA-1.5
  • Figure 5: Qualitative examples of object-level relational descriptions generated using Prompt B (Spatially Focused) with LLaVA-1.5. Compared to Prompt A, these descriptions include more explicit spatial terms (e.g., “on the left,” “behind”) and visual attributes, resulting in shorter but more positionally grounded sentences.
  • ...and 3 more figures