Table of Contents
Fetching ...

Unified Scene Representation and Reconstruction for 3D Large Language Models

Tao Chu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Qiong Liu, Jiaqi Wang

TL;DR

This work tackles the challenge of enabling LLMs to understand and interact with 3D scenes by proposing Uni3DR$^2$, a unified 3D representation and reconstruction framework that leverages frozen 2D foundation models (SAM and CLIP) and a multi-scale 3D decoder with GRU fusion to produce geometry and semantics directly from image sequences. A lightweight reconstruction module and a bridging Uni3DR$^2$-LLM pathway allow high-fidelity 3D features and TSDF-based geometry to feed LLMs via QFormer, achieving improved 3D reconstruction (e.g., $\text{F-Score}=0.580$ on ScanNet) and state-of-the-art 3D vision–language performance on ScanQA and 3DMV-VQA, even surpassing methods that rely on GT point clouds. Key contributions include the unified geometric-semantic 3D representation, a dual frozen-encoder pipeline, a GRU-based 3D decoder for point-to-point connectivity, and in-LLM fusion that enhances 3D V&L understanding with measurable gains in BLEU-1 and overall accuracy. This approach provides a robust pathway to integrate LLMs with 3D environments for robotics and embodied AI tasks, and lays groundwork for scalable, semantically rich 3D scene understanding.

Abstract

Enabling Large Language Models (LLMs) to interact with 3D environments is challenging. Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image aligned 2D features from CLIP are then lifted to point clouds, which serve as inputs for LLMs. However, this solution lacks the establishment of 3D point-to-point connections, leading to a deficiency of spatial structure information. Concurrently, the absence of integration and unification between the geometric and semantic representations of the scene culminates in a diminished level of 3D scene understanding. In this paper, we demonstrate the importance of having a unified scene representation and reconstruction framework, which is essential for LLMs in 3D scenes. Specifically, we introduce Uni3DR^2 extracts 3D geometric and semantic aware representation features via the frozen pre-trained 2D foundation models (e.g., CLIP and SAM) and a multi-scale aggregate 3D decoder. Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs. Experimental results validate that our Uni3DR^2 yields convincing gains over the baseline on the 3D reconstruction dataset ScanNet (increasing F-Score by +1.8\%). When applied to LLMs, our Uni3DR^2-LLM exhibits superior performance over the baseline on the 3D vision-language understanding dataset ScanQA (increasing BLEU-1 by +4.0\% and +4.2\% on the val set and test set, respectively). Furthermore, it outperforms the state-of-the-art method that uses additional GT point clouds on both ScanQA and 3DMV-VQA.

Unified Scene Representation and Reconstruction for 3D Large Language Models

TL;DR

This work tackles the challenge of enabling LLMs to understand and interact with 3D scenes by proposing Uni3DR, a unified 3D representation and reconstruction framework that leverages frozen 2D foundation models (SAM and CLIP) and a multi-scale 3D decoder with GRU fusion to produce geometry and semantics directly from image sequences. A lightweight reconstruction module and a bridging Uni3DR-LLM pathway allow high-fidelity 3D features and TSDF-based geometry to feed LLMs via QFormer, achieving improved 3D reconstruction (e.g., on ScanNet) and state-of-the-art 3D vision–language performance on ScanQA and 3DMV-VQA, even surpassing methods that rely on GT point clouds. Key contributions include the unified geometric-semantic 3D representation, a dual frozen-encoder pipeline, a GRU-based 3D decoder for point-to-point connectivity, and in-LLM fusion that enhances 3D V&L understanding with measurable gains in BLEU-1 and overall accuracy. This approach provides a robust pathway to integrate LLMs with 3D environments for robotics and embodied AI tasks, and lays groundwork for scalable, semantically rich 3D scene understanding.

Abstract

Enabling Large Language Models (LLMs) to interact with 3D environments is challenging. Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image aligned 2D features from CLIP are then lifted to point clouds, which serve as inputs for LLMs. However, this solution lacks the establishment of 3D point-to-point connections, leading to a deficiency of spatial structure information. Concurrently, the absence of integration and unification between the geometric and semantic representations of the scene culminates in a diminished level of 3D scene understanding. In this paper, we demonstrate the importance of having a unified scene representation and reconstruction framework, which is essential for LLMs in 3D scenes. Specifically, we introduce Uni3DR^2 extracts 3D geometric and semantic aware representation features via the frozen pre-trained 2D foundation models (e.g., CLIP and SAM) and a multi-scale aggregate 3D decoder. Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs. Experimental results validate that our Uni3DR^2 yields convincing gains over the baseline on the 3D reconstruction dataset ScanNet (increasing F-Score by +1.8\%). When applied to LLMs, our Uni3DR^2-LLM exhibits superior performance over the baseline on the 3D vision-language understanding dataset ScanQA (increasing BLEU-1 by +4.0\% and +4.2\% on the val set and test set, respectively). Furthermore, it outperforms the state-of-the-art method that uses additional GT point clouds on both ScanQA and 3DMV-VQA.
Paper Structure (14 sections, 12 equations, 4 figures, 6 tables)

This paper contains 14 sections, 12 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison between the representation for 3D LLMs.(a) The previous baseline hong20233d is isolated and complex, which requires extra NeRF, SLAM models, or depth to extract the point cloud and then lifts features to the 3D representations. (b) By contrast, our Uni3DR$^2$ is unified and neat. We unified learn geometric and semantically rich volumetric representation with high-quality reconstruction as LLM inputs. Our learned representation and reconstruction significantly enhance the LLM's performance in 3D environments.
  • Figure 2: Overview of our Uni3DR$^2$-LLM framework. Given video inputs, Uni3DR$^2$-LLM employs (1) a frozen encoder integrating SAM kirillov2023segment and CLIP radford2021learning image encoders, followed by a decoder for 3D representations, (2) a light-weight module focus on 3D reconstruction, and (3) an LLM integrated with QFormer li2023blip2 for 3D vision-language understanding. (: frozen.).
  • Figure 3: 3D reconstruction visualization results on ScanNet. Compared to the baseline method sun2021neuralrecon (second column), our method (third column) predicts the reconstruction results with more semantic details. (Zoom in for details.).
  • Figure 4: 3D vision-language understanding visualization results. Our Uni3DR$^2$-LLM predicts accurate reconstructions and answers the user input questions. (Zoom in for details.).