Table of Contents
Fetching ...

Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, Zhou Zhao

TL;DR

This paper tackles the bottleneck of object-level referencing and grounding in 3D multimodal LLMs by introducing explicit object identifiers and object-centric scene representations. It decomposes scenes into object proposals, attaches learnable <OBJ_i> tokens, and maps per-object features from 3D and 2D encoders into a unified sequence fed to an LLM, enabling task-unified QA-style training with minimal fine-tuning. Across five ScanNet-based benchmarks, Chat-Scene achieves state-of-the-art results on 3D grounding, dense captioning, and VQA without task-specific heads, and ablations highlight the value of learnable identifiers and multi-view features. The work reduces token costs, improves grounding accuracy, and demonstrates adaptability to 2D video input, pointing to scalable 3D scene understanding with LLMs.

Abstract

Recent advancements in 3D Large Language Models (LLMs) have demonstrated promising capabilities for 3D scene understanding. However, previous methods exhibit deficiencies in general referencing and grounding capabilities for intricate scene comprehension. In this paper, we introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Specifically, we decompose the input 3D scene into a set of object proposals, each assigned a unique identifier token, which enables efficient object referencing and grounding during user-assistant interactions. Given the scarcity of scene-language data, we model the scene embeddings as a sequence of explicit object-level embeddings, derived from semantic-rich 2D or 3D representations. By employing object identifiers, we transform diverse 3D scene-language tasks into a unified question-answering format, facilitating joint training without the need for additional task-specific heads. With minimal fine-tuning on all downstream tasks, our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.

Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

TL;DR

This paper tackles the bottleneck of object-level referencing and grounding in 3D multimodal LLMs by introducing explicit object identifiers and object-centric scene representations. It decomposes scenes into object proposals, attaches learnable <OBJ_i> tokens, and maps per-object features from 3D and 2D encoders into a unified sequence fed to an LLM, enabling task-unified QA-style training with minimal fine-tuning. Across five ScanNet-based benchmarks, Chat-Scene achieves state-of-the-art results on 3D grounding, dense captioning, and VQA without task-specific heads, and ablations highlight the value of learnable identifiers and multi-view features. The work reduces token costs, improves grounding accuracy, and demonstrates adaptability to 2D video input, pointing to scalable 3D scene understanding with LLMs.

Abstract

Recent advancements in 3D Large Language Models (LLMs) have demonstrated promising capabilities for 3D scene understanding. However, previous methods exhibit deficiencies in general referencing and grounding capabilities for intricate scene comprehension. In this paper, we introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Specifically, we decompose the input 3D scene into a set of object proposals, each assigned a unique identifier token, which enables efficient object referencing and grounding during user-assistant interactions. Given the scarcity of scene-language data, we model the scene embeddings as a sequence of explicit object-level embeddings, derived from semantic-rich 2D or 3D representations. By employing object identifiers, we transform diverse 3D scene-language tasks into a unified question-answering format, facilitating joint training without the need for additional task-specific heads. With minimal fine-tuning on all downstream tasks, our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
Paper Structure (20 sections, 2 equations, 7 figures, 12 tables)

This paper contains 20 sections, 2 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: An example of using object identifiers during the conversation.
  • Figure 2: Overall model architecture. The model processes a 3D scene's point cloud input by decomposing it into object proposals via a pre-trained detector. Subsequently, the 3D and 2D encoders are employed to extract object-centric representations. After projection layers, they are combined with object identifiers to form the scene embeddings as a sequence of object-level embeddings, which are then fed into the LLM. The assigned unique identifiers enable efficient object referencing in subsequent interactions.
  • Figure 3: Examples of various 3D scene-language understanding tasks. All the tasks are unified to single-turn question-answering pairs without extra task heads. Object identifiers are used to reference and ground the object during the conversation.
  • Figure 4: Visualization results of video grounding for video input. "GT" denotes the projected 2D masks derived from the ground-truth 3D point cloud mask.
  • Figure 5: Visualization results of 3D question answering on ScanQA scanqa.
  • ...and 2 more figures