Spatio-Temporal LLM: Reasoning about Environments and Actions

Haozhen Zheng; Beitong Tian; Mingyuan Wu; Zhenggang Tang; Klara Nahrstedt; Alex Schwing

Spatio-Temporal LLM: Reasoning about Environments and Actions

Haozhen Zheng, Beitong Tian, Mingyuan Wu, Zhenggang Tang, Klara Nahrstedt, Alex Schwing

TL;DR

Current multimodal LLMs struggle with prompts that require grounding both a full 3D environment and evolving egocentric actions. The paper introduces the REA dataset and two STLLM baselines (STLLM-3D and STLLM-Aligner) to fuse point-cloud context with video and text, and evaluates them with LLM judges. Results show substantial gains over existing MLLMs on REA, with STLLM-Aligner achieving the best performance and demonstrating cross-dataset generalization to SQA3D. This work provides a concrete step toward intrinsic spatio-temporal grounding for embodied AI and offers datasets and baselines to drive further research in joint 3D-spatial and temporal reasoning.

Abstract

Despite significant recent progress of Multimodal Large Language Models (MLLMs), current MLLMs are challenged by "spatio-temporal" prompts, i.e., prompts that refer to 1) the entirety of an environment encoded in a point cloud that the MLLM should consider; and simultaneously also refer to 2) actions that happened in part of the environment and are encoded in a short ego-centric video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this challenge, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent MLLMs indeed struggle to correctly answer "spatio-temporal" prompts. Building on this dataset, we study two spatio-temporal LLM (STLLM) baselines: 1) STLLM-3D, which directly fuses point cloud, video, and text representations as inputs to the LLM; and 2) STLLM-Aligner, which aligns spatial context with video and text before LLM decoding. Both baselines aim to enhance spatial understanding of environments and temporal grounding of egocentric observations. On REA, the STLLM baselines outperform existing models, demonstrating the effectiveness of our designs. Code and data are available at https://zoezheng126.github.io/STLLM-website/.

Spatio-Temporal LLM: Reasoning about Environments and Actions

TL;DR

Abstract

Spatio-Temporal LLM: Reasoning about Environments and Actions

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)