Table of Contents
Fetching ...

HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model

Chen Li, Eric Peh, Basura Fernando

TL;DR

This work tackles the challenge of extending vision-language reasoning to 3D scenes by enabling explicit input-level alignment between 3D data and large vision-language models. It translates 3D scenes into structured text descriptions and renders multi-view images, then fuses these with a hierarchical visual representation to provide both local and global context for reasoning. The approach, fine-tuned with LoRA on a pretrained Qwen-VL backbone, achieves state-of-the-art results on situated (SQA3D) and general (ScanQA) 3D Q&A benchmarks, outperforming prior latent-space alignment methods. The findings underscore the value of explicit cross-modal grounding and multi-view aggregation for robust 3D scene understanding with VLMs, while acknowledging segmentation quality and complex spatial reasoning as areas for improvement.

Abstract

Recent advances in large vision-language models (VLMs) have shown significant promise for 3D scene understanding. Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space. However, this implicit alignment often yields suboptimal performance due to the scarcity of 3D data and the inherent complexity of spatial relationships in 3D environments. To address these limitations, we propose a novel hierarchical multimodal representation for 3D scene reasoning that explicitly aligns with VLMs at the input space by leveraging both multi-view images and text descriptions. The text descriptions capture spatial relationships by referencing the 3D coordinates of detected objects, while the multi-view images include a top-down perspective and four directional views (forward, left, right, and backward), ensuring comprehensive scene coverage. Additionally, we introduce a hierarchical feature representation that aggregates patch-level image features into view-level and scene-level representations, enabling the model to reason over both local and global scene context. Experimental results on both situated 3D Q&A and general 3D Q&A benchmarks demonstrate the effectiveness of our approach.

HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model

TL;DR

This work tackles the challenge of extending vision-language reasoning to 3D scenes by enabling explicit input-level alignment between 3D data and large vision-language models. It translates 3D scenes into structured text descriptions and renders multi-view images, then fuses these with a hierarchical visual representation to provide both local and global context for reasoning. The approach, fine-tuned with LoRA on a pretrained Qwen-VL backbone, achieves state-of-the-art results on situated (SQA3D) and general (ScanQA) 3D Q&A benchmarks, outperforming prior latent-space alignment methods. The findings underscore the value of explicit cross-modal grounding and multi-view aggregation for robust 3D scene understanding with VLMs, while acknowledging segmentation quality and complex spatial reasoning as areas for improvement.

Abstract

Recent advances in large vision-language models (VLMs) have shown significant promise for 3D scene understanding. Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space. However, this implicit alignment often yields suboptimal performance due to the scarcity of 3D data and the inherent complexity of spatial relationships in 3D environments. To address these limitations, we propose a novel hierarchical multimodal representation for 3D scene reasoning that explicitly aligns with VLMs at the input space by leveraging both multi-view images and text descriptions. The text descriptions capture spatial relationships by referencing the 3D coordinates of detected objects, while the multi-view images include a top-down perspective and four directional views (forward, left, right, and backward), ensuring comprehensive scene coverage. Additionally, we introduce a hierarchical feature representation that aggregates patch-level image features into view-level and scene-level representations, enabling the model to reason over both local and global scene context. Experimental results on both situated 3D Q&A and general 3D Q&A benchmarks demonstrate the effectiveness of our approach.

Paper Structure

This paper contains 21 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison of (a) conventional embedding space alignment, which maps 3D point cloud features into the VLM via feature extraction and projection, and (b) our proposed input space alignment, which converts the 3D scene into textual descriptions and multi-view rendered images before feeding them directly into the VLM.
  • Figure 2: Overview of our proposed pipeline, which consists of two main stages: (1) generation of structured text descriptions from the 3D scene and (2) multi-view image rendering with hierarchical visual feature extraction. Both the textual and visual information are jointly fed into model to generate the final answer.
  • Figure 3: An example of the rendered multi-view images, including a bird's-eye view and four top-down directions (forward, left, right, and backward) to ensures comprehensive spatial coverage of the 3D scene.
  • Figure 4: Prompt template for our model
  • Figure 5: Qualitative examples from SQA3D dataset. The red arrow indicates the agent's position and facing direction as referenced in the situational description.
  • ...and 1 more figures