Table of Contents
Fetching ...

Situational Awareness Matters in 3D Vision Language Reasoning

Yunze Man, Liang-Yan Gui, Yu-Xiong Wang

TL;DR

This work introduces SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning, and tokenizes the 3D scene into sparse voxel representation, and proposes a language-grounded situation estimator, followed by a situated question answering module.

Abstract

Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situation estimation and question answering by a large margin (e.g., an enhancement of over 30% on situation estimation accuracy). Subsequent analysis corroborates our architectural design choices, explores the distinct functions of visual and textual tokens, and highlights the importance of situational awareness in the domain of 3D question answering.

Situational Awareness Matters in 3D Vision Language Reasoning

TL;DR

This work introduces SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning, and tokenizes the 3D scene into sparse voxel representation, and proposes a language-grounded situation estimator, followed by a situated question answering module.

Abstract

Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situation estimation and question answering by a large margin (e.g., an enhancement of over 30% on situation estimation accuracy). Subsequent analysis corroborates our architectural design choices, explores the distinct functions of visual and textual tokens, and highlights the importance of situational awareness in the domain of 3D question answering.
Paper Structure (20 sections, 1 equation, 10 figures, 6 tables)

This paper contains 20 sections, 1 equation, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Previous methods perform direct 3D vision language reasoning without modeling the situation of an embodied agent in the 3D environment. Our method, SIG3D, grounds the situational description in the 3D space, and then re-encodes the visual tokens from the agent's intended perspective before vision-language fusion, resulting in a more comprehensive and generalized 3D vision language (VL) representation and reasoning framework. Q, K, V stand for query, key, and value, respectively.
  • Figure 2: Situation estimation in existing methods ma2022sqa3d fails in most scenarios, indicating the missing registration between the situational descriptions and 3D embeddings. Red: Ground truth (GT) vector. Blue: Estimated vector.
  • Figure 3: Results on variants of the representative SQA3D baseline method ma2022sqa3d demonstrate that situational understanding, despite being indispensable in perceiving the context of questions, makes negligible contribution in existing methods. This motivates a situation-guided 3D encoding mechanism in our model.
  • Figure 4: Overview of our SIG3D model, which includes 3D scene and text encoding, anchor-based situation estimation, situation-guided visual re-encoding, and multi-modal decoder modules. We tokenize the 3D scene into voxels, treat each token as an anchor point, and query the text tokens to predict a token-level position likelihood and rotation matrix to locate the situational vector associated with the textual description. Then we update the scene tokens with situational position encoding (PE), and finally perform the 3D VL reasoning task with a large transformer decoder.
  • Figure 5: Qualitative results demonstrate significant improvement of SIG3D over prior methods. The first row is results from SQA3D ma2022sqa3d, and the second row is results from our method. In the 3D scene, red: Ground truth (GT) vector, and blue: Estimated vector.
  • ...and 5 more figures