HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model
Chen Li, Eric Peh, Basura Fernando
TL;DR
This work tackles the challenge of extending vision-language reasoning to 3D scenes by enabling explicit input-level alignment between 3D data and large vision-language models. It translates 3D scenes into structured text descriptions and renders multi-view images, then fuses these with a hierarchical visual representation to provide both local and global context for reasoning. The approach, fine-tuned with LoRA on a pretrained Qwen-VL backbone, achieves state-of-the-art results on situated (SQA3D) and general (ScanQA) 3D Q&A benchmarks, outperforming prior latent-space alignment methods. The findings underscore the value of explicit cross-modal grounding and multi-view aggregation for robust 3D scene understanding with VLMs, while acknowledging segmentation quality and complex spatial reasoning as areas for improvement.
Abstract
Recent advances in large vision-language models (VLMs) have shown significant promise for 3D scene understanding. Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space. However, this implicit alignment often yields suboptimal performance due to the scarcity of 3D data and the inherent complexity of spatial relationships in 3D environments. To address these limitations, we propose a novel hierarchical multimodal representation for 3D scene reasoning that explicitly aligns with VLMs at the input space by leveraging both multi-view images and text descriptions. The text descriptions capture spatial relationships by referencing the 3D coordinates of detected objects, while the multi-view images include a top-down perspective and four directional views (forward, left, right, and backward), ensuring comprehensive scene coverage. Additionally, we introduce a hierarchical feature representation that aggregates patch-level image features into view-level and scene-level representations, enabling the model to reason over both local and global scene context. Experimental results on both situated 3D Q&A and general 3D Q&A benchmarks demonstrate the effectiveness of our approach.
