Table of Contents
Fetching ...

Towards Foundation Models for 3D Vision: How Close Are We?

Yiming Zuo, Karhan Kayan, Maggie Wang, Kevin Jeon, Jia Deng, Thomas L. Griffiths

TL;DR

This work assesses whether current 2D foundation models possess 3D understanding by introducing UniQA-3D, a unified, VQA-style benchmark for four core 3D tasks. It benchmarkses closed-source VLMs, specialized depth/pose/keypoint systems, and humans, revealing that VLMs struggle with 3D tasks while specialized models are accurate but fragile under geometric perturbations; humans remain the most robust. The study also shows Transformer-based approaches align more closely with human 3D perception than CNNs, offering actionable insights for building robust 3D foundation models. The UniQA-3D benchmark enables fair cross-task comparisons and advances understanding of how to improve 3D vision systems for robotics and related applications.

Abstract

Building a foundation model for 3D vision is a complex challenge that remains unsolved. Towards that goal, it is important to understand the 3D reasoning capabilities of current models as well as identify the gaps between these models and humans. Therefore, we construct a new 3D visual understanding benchmark named UniQA-3D. UniQA-3D covers fundamental 3D vision tasks in the Visual Question Answering (VQA) format. We evaluate state-of-the-art Vision-Language Models (VLMs), specialized models, and human subjects on it. Our results show that VLMs generally perform poorly, while the specialized models are accurate but not robust, failing under geometric perturbations. In contrast, human vision continues to be the most reliable 3D visual system. We further demonstrate that neural networks align more closely with human 3D vision mechanisms compared to classical computer vision methods, and Transformer-based networks such as ViT align more closely with human 3D vision mechanisms than CNNs. We hope our study will benefit the future development of foundation models for 3D vision. Code is available at https://github.com/princeton-vl/UniQA-3D .

Towards Foundation Models for 3D Vision: How Close Are We?

TL;DR

This work assesses whether current 2D foundation models possess 3D understanding by introducing UniQA-3D, a unified, VQA-style benchmark for four core 3D tasks. It benchmarkses closed-source VLMs, specialized depth/pose/keypoint systems, and humans, revealing that VLMs struggle with 3D tasks while specialized models are accurate but fragile under geometric perturbations; humans remain the most robust. The study also shows Transformer-based approaches align more closely with human 3D perception than CNNs, offering actionable insights for building robust 3D foundation models. The UniQA-3D benchmark enables fair cross-task comparisons and advances understanding of how to improve 3D vision systems for robotics and related applications.

Abstract

Building a foundation model for 3D vision is a complex challenge that remains unsolved. Towards that goal, it is important to understand the 3D reasoning capabilities of current models as well as identify the gaps between these models and humans. Therefore, we construct a new 3D visual understanding benchmark named UniQA-3D. UniQA-3D covers fundamental 3D vision tasks in the Visual Question Answering (VQA) format. We evaluate state-of-the-art Vision-Language Models (VLMs), specialized models, and human subjects on it. Our results show that VLMs generally perform poorly, while the specialized models are accurate but not robust, failing under geometric perturbations. In contrast, human vision continues to be the most reliable 3D visual system. We further demonstrate that neural networks align more closely with human 3D vision mechanisms compared to classical computer vision methods, and Transformer-based networks such as ViT align more closely with human 3D vision mechanisms than CNNs. We hope our study will benefit the future development of foundation models for 3D vision. Code is available at https://github.com/princeton-vl/UniQA-3D .

Paper Structure

This paper contains 20 sections, 1 equation, 10 figures, 2 tables.

Figures (10)

  • Figure 1: (a) We sample images from the KITTI dataset and flip to create upside-down images. (b) Comparison of accuracy of different methods. MiDaS-DPT works the best in general, and both MiDaS models are slightly better than humans. All the VLMs perform poorly, with GPT4-Omni performing the best on regular inputs. (c) VLMs have multiple failure modes. See text for details.
  • Figure 2: We compare the similarity between humans and different models using different metrics, including (a) pair sampling strategy, (b) relative depth difference, (c) Cohen's $\kappa$, and (d) semantic labels. Best viewed zoomed-in and in colors.
  • Figure 3: Results on the spatial reasoning task. (a) Our benchmark requires a strong spatial reasoning ability and is very challenging. (b) Even the specialized VQA model MDETR can only achieve 74.4% accuracy. (c) model accuracy drops as the scene complexity grows (more objects). (d) longer questions don't necessarily lead to worse performance. See text for detailed analysis.
  • Figure 4: Comparison between specialist neural networks, LVMs, and humans on relative camera pose classification. The bars are 95% confidence intervals.
  • Figure 5: Matching experiment results. Transformer-based LightGlue is more similar to human matching than the classical ORB matcher.
  • ...and 5 more figures