Table of Contents
Fetching ...

ChatSplat: 3D Conversational Gaussian Splatting

Hanlin Chen, Fangyin Wei, Gim Hee Lee

TL;DR

ChatSplat addresses the challenge of enabling natural language interaction with complex 3D environments by embedding language into 3D Gaussians and interfacing with large language models. It introduces a hierarchical, patch-wise object-language embedding that decouples masks from feature maps, a view- and scene-level encoder to produce LLM-ready tokens, and a scene-specific autoencoder-style normalization to stabilize learning across diverse language embeddings. The method supports object-, view-, and scene-level chatting with real-time performance, outperforming CLIP/LangSplat-based baselines on open-ended 3D chat tasks while maintaining higher FPS. This work advances interactive 3D scene understanding for applications in robotics, AR/VR, and immersive querying by enabling fluid, language-guided exploration of 3D content.

Abstract

Humans naturally interact with their 3D surroundings using language, and modeling 3D language fields for scene understanding and interaction has gained growing interest. This paper introduces ChatSplat, a system that constructs a 3D language field, enabling rich chat-based interaction within 3D space. Unlike existing methods that primarily use CLIP-derived language features focused solely on segmentation, ChatSplat facilitates interaction on three levels: objects, views, and the entire 3D scene. For view-level interaction, we designed an encoder that encodes the rendered feature map of each view into tokens, which are then processed by a large language model (LLM) for conversation. At the scene level, ChatSplat combines multi-view tokens, enabling interactions that consider the entire scene. For object-level interaction, ChatSplat uses a patch-wise language embedding, unlike LangSplat's pixel-wise language embedding that implicitly includes mask and embedding. Here, we explicitly decouple the language embedding into separate mask and feature map representations, allowing more flexible object-level interaction. To address the challenge of learning 3D Gaussians posed by the complex and diverse distribution of language embeddings used in the LLM, we introduce a learnable normalization technique to standardize these embeddings, facilitating effective learning. Extensive experimental results demonstrate that ChatSplat supports multi-level interactions -- object, view, and scene -- within 3D space, enhancing both understanding and engagement.

ChatSplat: 3D Conversational Gaussian Splatting

TL;DR

ChatSplat addresses the challenge of enabling natural language interaction with complex 3D environments by embedding language into 3D Gaussians and interfacing with large language models. It introduces a hierarchical, patch-wise object-language embedding that decouples masks from feature maps, a view- and scene-level encoder to produce LLM-ready tokens, and a scene-specific autoencoder-style normalization to stabilize learning across diverse language embeddings. The method supports object-, view-, and scene-level chatting with real-time performance, outperforming CLIP/LangSplat-based baselines on open-ended 3D chat tasks while maintaining higher FPS. This work advances interactive 3D scene understanding for applications in robotics, AR/VR, and immersive querying by enabling fluid, language-guided exploration of 3D content.

Abstract

Humans naturally interact with their 3D surroundings using language, and modeling 3D language fields for scene understanding and interaction has gained growing interest. This paper introduces ChatSplat, a system that constructs a 3D language field, enabling rich chat-based interaction within 3D space. Unlike existing methods that primarily use CLIP-derived language features focused solely on segmentation, ChatSplat facilitates interaction on three levels: objects, views, and the entire 3D scene. For view-level interaction, we designed an encoder that encodes the rendered feature map of each view into tokens, which are then processed by a large language model (LLM) for conversation. At the scene level, ChatSplat combines multi-view tokens, enabling interactions that consider the entire scene. For object-level interaction, ChatSplat uses a patch-wise language embedding, unlike LangSplat's pixel-wise language embedding that implicitly includes mask and embedding. Here, we explicitly decouple the language embedding into separate mask and feature map representations, allowing more flexible object-level interaction. To address the challenge of learning 3D Gaussians posed by the complex and diverse distribution of language embeddings used in the LLM, we introduce a learnable normalization technique to standardize these embeddings, facilitating effective learning. Extensive experimental results demonstrate that ChatSplat supports multi-level interactions -- object, view, and scene -- within 3D space, enhancing both understanding and engagement.

Paper Structure

This paper contains 20 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: ChatSplat is the first 3D Gaussian Splatting-based approach that enables conversational interaction with a 3D environment across multiple levels, including view, object, and the entire scene. The key idea is to learn a 3D conversational field with 3D Gaussians the renderings of which can be encoded into tokens to seamlessly connect with LLM.
  • Figure 2: Overview of ChatSplat. Our framework generates hierarchical language feature maps to support view-, object-, and scene-level chatting. For object-level chatting, we first render a mask to isolate the feature map of the selected object. The proposed encoder then converts the feature map into LLM's input through dimensionality lifting and tokenization.
  • Figure 3: Qualitative comparison on view-level chatting. Correct answers are highlighted with green boxes, while incorrect ones are marked in red. Our ChatSplat outperforms the baseline method, LLaVA-OV, primarily because LLaVA-OV relies solely on rendered images. These rendered images inherently contain errors, which are propagated to LLaVA-OV, leading to compounded inaccuracies in the final output. Specifically, as shown in the first row of Fig. \ref{['fig:view']}, LLaVA-OV incorrectly outputs "visible scratches" due to rendering errors, with scratches erroneously appearing in the rendered image on the right.
  • Figure 4: Qualitative results on object-level chatting. For object-level chatting, the object is selected interactively using a mouse and highlighted in red for visual identification.
  • Figure 5: Qualitative results on scene-level chatting. For scene-level chatting, multi-view feature maps are processed by the decoder to generate tokens that collectively represent the language context of the entire scene. These tokens are then input into an LLM to enable conversational interaction at the scene level.