Table of Contents
Fetching ...

Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

Haifeng Huang, Yilun Chen, Zehan Wang, Jiangmiao Pang, Zhou Zhao

Abstract

Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders, unlike the isolated per-object features in Chat-Scene. Its flexible object-centric design also supports grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels during multi-step inference. Without the need for additional task-specific heads or fine-tuning, Chat-Scene++ achieves state-of-the-art performance on five major 3D vision-language benchmarks: ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D. These results highlight its effectiveness in scene comprehension, object grounding, and spatial reasoning. Additionally, without reconstructing 3D worlds through computationally expensive processes, we demonstrate its applicability to real-world scenarios using only 2D inputs.

Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

Abstract

Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders, unlike the isolated per-object features in Chat-Scene. Its flexible object-centric design also supports grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels during multi-step inference. Without the need for additional task-specific heads or fine-tuning, Chat-Scene++ achieves state-of-the-art performance on five major 3D vision-language benchmarks: ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D. These results highlight its effectiveness in scene comprehension, object grounding, and spatial reasoning. Additionally, without reconstructing 3D worlds through computationally expensive processes, we demonstrate its applicability to real-world scenarios using only 2D inputs.

Paper Structure

This paper contains 14 sections, 4 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Chat-Scene++ surpasses previous SOTA methods on five benchmarks in the field of 3D scene understanding.
  • Figure 2: Example of utilizing object identifiers in conversation.
  • Figure 3: Overall model architecture of Chat-Scene++. The model structures a 3D scene as a context-rich object sequence, forming the scene embeddings for the LLM input. Specifically, it decomposes the 3D scene into a sequence of object representations paired with object ID tokens. Context-rich object features are extracted using large-scale pre-trained 3D and 2D models. These features are then mapped to the LLM's embedding space, where scene embeddings are constructed by sequentially combining object IDs with their corresponding object-level embeddings. By leveraging flexible object IDs, Grounded CoT can be optionally enabled to enhance reasoning over object relationships.
  • Figure 4: Illustration of multi-modal context-rich feature extraction. Chat-Scene chatscene extracts object-centric features for separate object proposals, while Chat-Scene++ extracts context-rich object features using 3D scene-level encoders and 2D image-level encoders.
  • Figure 5: Examples of various 3D scene-language understanding tasks. All the tasks are unified to single-turn question-answering pairs without extra task heads. Object identifiers are used to reference and ground the object during the conversation.
  • ...and 1 more figures