Table of Contents
Fetching ...

DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model

Hao Yang, Hongbo Zhang, Yanyan Zhao, Bing Qin

TL;DR

This work introduces DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding, and develops a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios.

Abstract

Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in visual data. In this work, we introduce DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding. Unlike conventional methods that align RGB image encodings with text, our approach takes advantage of the unique characteristics of depth images: single-channel grayscale images where the pixel values directly reflect depth cues to improve spatial reasoning. To address challenges associated with limited depth data and the inadequacy of simple channel replication, we construct a novel depth image-text pair dataset and a depth instruction dataset. Depth maps are generated from visual images using the GLPN model, and GPT-4 is employed to curate corresponding depth instructions, an approach validated by LLaVA. Additionally, we modify the ViT encoder in CLIP to incorporate local object information, thereby capturing the subtle continuous variations of depth more effectively. To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios. Experimental results demonstrate that DeepSight significantly enhances depth perception and downstream task performance, marking a substantial step forward in multimodal three-dimensional understanding.

DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model

TL;DR

This work introduces DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding, and develops a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios.

Abstract

Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in visual data. In this work, we introduce DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding. Unlike conventional methods that align RGB image encodings with text, our approach takes advantage of the unique characteristics of depth images: single-channel grayscale images where the pixel values directly reflect depth cues to improve spatial reasoning. To address challenges associated with limited depth data and the inadequacy of simple channel replication, we construct a novel depth image-text pair dataset and a depth instruction dataset. Depth maps are generated from visual images using the GLPN model, and GPT-4 is employed to curate corresponding depth instructions, an approach validated by LLaVA. Additionally, we modify the ViT encoder in CLIP to incorporate local object information, thereby capturing the subtle continuous variations of depth more effectively. To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios. Experimental results demonstrate that DeepSight significantly enhances depth perception and downstream task performance, marking a substantial step forward in multimodal three-dimensional understanding.
Paper Structure (19 sections, 1 equation, 6 figures, 6 tables)

This paper contains 19 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Stereoscopic Vision Ability Test for MLLMs. We input RGB images into the MLLMs and ask questions about distance comparisons between objects. The results show that the MLLMs have issues with its stereoscopic vision ability.
  • Figure 2: Benchmark construction. (a) defines the types of tasks. (b) describes how benchmark is constructed from templates.
  • Figure 3: Sample from the Depth Instruction Dataset. The left side shows a sample of caption data used for alignment, and the right side shows an instruction sample generated by GPT-4.
  • Figure 4: Overall model architecture. The left figure shows the Depth Encoder architecture, where DeepSight modifies the CLIP image encoder to take an additional Bbox channel along with depth convolution. The right figure illustrates the model pipeline, showing the data flow from the input question and depth image to the output answer. In the alignment stage, only the Alignment Layer is trained while keeping the Depth Encoder, Text Encoder, and LLM frozen. In the fine-tuning stage, the other modules remain frozen and additional training is performed on the LLM.
  • Figure 5: Sample ratio search experiment. We search sample ratio with a step of 0.05. Test metric is zero-shot Scene Classification top-1 accuracy.
  • ...and 1 more figures