Table of Contents
Fetching ...

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

Yunze Man, Shuhong Zheng, Zhipeng Bao, Martial Hebert, Liang-Yan Gui, Yu-Xiong Wang

TL;DR

Lexicon3D introduces a unified probing framework to systematically evaluate image-, video-, and 3D-based vision foundation models on complex 3D scene understanding across four tasks: vision-language reasoning, visual grounding, semantic segmentation, and registration. By freezing encoders and training shallow heads, it projects multi-view features into a shared 3D representation and analyzes performance across seven VFMs. Key findings include DINOv2 as a strong general backbone, video models excelling in object-level and geometric tasks, diffusion models enhancing geometric registration, and language-pretrained encoders not always improving language-guided tasks, with MoVE fusion providing robust gains. The work highlights the importance of flexible encoder selection and feature fusion to advance scalable, multimodal 3D scene understanding.

Abstract

Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks. Code: https://github.com/YunzeMan/Lexicon3D

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

TL;DR

Lexicon3D introduces a unified probing framework to systematically evaluate image-, video-, and 3D-based vision foundation models on complex 3D scene understanding across four tasks: vision-language reasoning, visual grounding, semantic segmentation, and registration. By freezing encoders and training shallow heads, it projects multi-view features into a shared 3D representation and analyzes performance across seven VFMs. Key findings include DINOv2 as a strong general backbone, video models excelling in object-level and geometric tasks, diffusion models enhancing geometric registration, and language-pretrained encoders not always improving language-guided tasks, with MoVE fusion providing robust gains. The work highlights the importance of flexible encoder selection and feature fusion to advance scalable, multimodal 3D scene understanding.

Abstract

Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks. Code: https://github.com/YunzeMan/Lexicon3D
Paper Structure (25 sections, 2 equations, 10 figures, 9 tables)

This paper contains 25 sections, 2 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Evaluation settings and major results of different vision foundation models (VFMs) for complex 3D scene understanding. We assess the performance of VFMs on multimodal scene reasoning, grounding, segmentation, and registration tasks.
  • Figure 2: Our unified probing framework to evaluate visual foundation models on various tasks.
  • Figure 3: Visualization of extracted scene features from different visual foundation models. We use principal component analysis (PCA) to compress the feature embeddings into three dimensions. The clear distinction between colors and patterns demonstrates the behaviors of different models.
  • Figure 4: Evaluation curves on the ScanQA benchmark. The $x$-axis demonstrates models trained for different epochs. DINOv2 exhibits clearly superior performance.
  • Figure 5: Visualization of 3D semantic segmentation on ScanNet scannet. Image encoders obtain better performance.
  • ...and 5 more figures