Direction-aware 3D Large Multimodal Models
Quan Liu, Weihao Xuan, Junjue Wang, Naoto Yokoya, Ling Shao, Shijian Lu
TL;DR
This work addresses the ill-posed directional reasoning problem in 3D large multimodal models caused by missing ego poses. It introduces PoseRecover to automatically recover ego poses from RGB-D data and PoseAlign to align point clouds or embeddings to the recovered ego frame, enabling persistent direction awareness across diverse architectures. Experiments on multiple benchmarks (ScanRefer, ScanQA, Scan2Cap, Multi3DRefer) and backbones (LL3DA, LL3DA-SONATA, Chat-Scene, 3D-LLAVA) show substantial gains in direction-sensitive tasks, including up to 30.0% improvements in ScanRefer mIoU and notable gains in LLM-as-judge accuracy. The approach is lightweight, training-efficient, and broadly applicable, establishing a strong baseline for direction-aware 3D-LMMs in embodied indoor scenarios.
Abstract
3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional question-answering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses. We enable direction-aware 3D LMMs with two novel designs. The first is PoseRecover, a fully automatic pose recovery pipeline that matches questions with ego poses from RGB-D video extrinsics via object-frustum intersection and visibility check with Z-buffers. The second is PoseAlign that transforms the point cloud data to be aligned with the identified ego poses instead of either injecting ego poses into textual prompts or introducing pose-encoded features in the projection layers. Extensive experiments show that our designs yield consistent improvements across multiple 3D LMM backbones such as LL3DA, LL3DA-SONATA, Chat-Scene, and 3D-LLAVA, improving ScanRefer mIoU by 30.0% and Scan2Cap LLM-as-judge accuracy by 11.7%. In addition, our approach is simple, generic, and training-efficient, requiring only instruction tuning while establishing a strong baseline for direction-aware 3D-LMMs.
