Table of Contents
Fetching ...

Direction-aware 3D Large Multimodal Models

Quan Liu, Weihao Xuan, Junjue Wang, Naoto Yokoya, Ling Shao, Shijian Lu

TL;DR

This work addresses the ill-posed directional reasoning problem in 3D large multimodal models caused by missing ego poses. It introduces PoseRecover to automatically recover ego poses from RGB-D data and PoseAlign to align point clouds or embeddings to the recovered ego frame, enabling persistent direction awareness across diverse architectures. Experiments on multiple benchmarks (ScanRefer, ScanQA, Scan2Cap, Multi3DRefer) and backbones (LL3DA, LL3DA-SONATA, Chat-Scene, 3D-LLAVA) show substantial gains in direction-sensitive tasks, including up to 30.0% improvements in ScanRefer mIoU and notable gains in LLM-as-judge accuracy. The approach is lightweight, training-efficient, and broadly applicable, establishing a strong baseline for direction-aware 3D-LMMs in embodied indoor scenarios.

Abstract

3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional question-answering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses. We enable direction-aware 3D LMMs with two novel designs. The first is PoseRecover, a fully automatic pose recovery pipeline that matches questions with ego poses from RGB-D video extrinsics via object-frustum intersection and visibility check with Z-buffers. The second is PoseAlign that transforms the point cloud data to be aligned with the identified ego poses instead of either injecting ego poses into textual prompts or introducing pose-encoded features in the projection layers. Extensive experiments show that our designs yield consistent improvements across multiple 3D LMM backbones such as LL3DA, LL3DA-SONATA, Chat-Scene, and 3D-LLAVA, improving ScanRefer mIoU by 30.0% and Scan2Cap LLM-as-judge accuracy by 11.7%. In addition, our approach is simple, generic, and training-efficient, requiring only instruction tuning while establishing a strong baseline for direction-aware 3D-LMMs.

Direction-aware 3D Large Multimodal Models

TL;DR

This work addresses the ill-posed directional reasoning problem in 3D large multimodal models caused by missing ego poses. It introduces PoseRecover to automatically recover ego poses from RGB-D data and PoseAlign to align point clouds or embeddings to the recovered ego frame, enabling persistent direction awareness across diverse architectures. Experiments on multiple benchmarks (ScanRefer, ScanQA, Scan2Cap, Multi3DRefer) and backbones (LL3DA, LL3DA-SONATA, Chat-Scene, 3D-LLAVA) show substantial gains in direction-sensitive tasks, including up to 30.0% improvements in ScanRefer mIoU and notable gains in LLM-as-judge accuracy. The approach is lightweight, training-efficient, and broadly applicable, establishing a strong baseline for direction-aware 3D-LMMs in embodied indoor scenarios.

Abstract

3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional question-answering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses. We enable direction-aware 3D LMMs with two novel designs. The first is PoseRecover, a fully automatic pose recovery pipeline that matches questions with ego poses from RGB-D video extrinsics via object-frustum intersection and visibility check with Z-buffers. The second is PoseAlign that transforms the point cloud data to be aligned with the identified ego poses instead of either injecting ego poses into textual prompts or introducing pose-encoded features in the projection layers. Extensive experiments show that our designs yield consistent improvements across multiple 3D LMM backbones such as LL3DA, LL3DA-SONATA, Chat-Scene, and 3D-LLAVA, improving ScanRefer mIoU by 30.0% and Scan2Cap LLM-as-judge accuracy by 11.7%. In addition, our approach is simple, generic, and training-efficient, requiring only instruction tuning while establishing a strong baseline for direction-aware 3D-LMMs.
Paper Structure (60 sections, 7 equations, 6 figures, 8 tables)

This paper contains 60 sections, 7 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Ego pose is critical in spatial reasoning and understanding. (a) Direction-agnostic 3D LMMs are struggling to reason spatial directions due to the absence of ego-pose information. (b) Incorporating ego pose resolves directional ambiguity, enabling consistent and robust spatial reasoning.
  • Figure 2: The offline data generation pipeline for PoseRecover. (a) Object annotations and camera poses are obtained from ScanNet-v2 dai2017scannet. Camera poses and objects are downsampled for visibility. Zoom in for details. (b) PoseRecover exhaustively calculates the intersection rates between objects and camera frustums. (c) Visibility of the intersection is further validated with a z-buffer. (d) These intersection rates are saved and later sampled during training or inference to supplement ego poses to models.
  • Figure 3: Three viable designs for PoseAlign. We explore three mutually exclusive designs to incorporate ego poses into the vanilla model in (a): 1) PoseAlign-Transform that shifts point clouds to the ego reference frame in (b); 2) PoseAlign-Embed that encodes ego poses into point cloud features in (c); 3) PoseAlign-Prompt that integrates ego poses into the text prompt in (d). The projection layer and the LoRA hu2022lora weights of the LLM are trained with instruction-tuning.
  • Figure 4: Effect of the pose clipping. The KDE turlach1993bandwidth of maximum yaw difference among pose candidates rapidly concentrates around zero with increasing clip ratio in ScanQA. Higher clip ratio reduces data variety but boosts pose stability.
  • Figure 5: Qualitative results of direction-critical questions for 3D-LLAVA baseline (top row) and PoseAlign-Transform (bottom row). The XYZ axes of the world coordinate frame are colored with red, green, and blue, respectively. The baseline paradigm uses default world coordinates of ScanNet-v2, which are non-informative. Instead, the PoseAlign paradigm aligns the coordinate frame to the recovered ego pose, providing an anchor for robust spatial reasoning. Red text highlights wrong answers and green text highlights correct answers.
  • ...and 1 more figures