Table of Contents
Fetching ...

Spatio-Temporal LLM: Reasoning about Environments and Actions

Haozhen Zheng, Beitong Tian, Mingyuan Wu, Zhenggang Tang, Klara Nahrstedt, Alex Schwing

TL;DR

Current multimodal LLMs struggle with prompts that require grounding both a full 3D environment and evolving egocentric actions. The paper introduces the REA dataset and two STLLM baselines (STLLM-3D and STLLM-Aligner) to fuse point-cloud context with video and text, and evaluates them with LLM judges. Results show substantial gains over existing MLLMs on REA, with STLLM-Aligner achieving the best performance and demonstrating cross-dataset generalization to SQA3D. This work provides a concrete step toward intrinsic spatio-temporal grounding for embodied AI and offers datasets and baselines to drive further research in joint 3D-spatial and temporal reasoning.

Abstract

Despite significant recent progress of Multimodal Large Language Models (MLLMs), current MLLMs are challenged by "spatio-temporal" prompts, i.e., prompts that refer to 1) the entirety of an environment encoded in a point cloud that the MLLM should consider; and simultaneously also refer to 2) actions that happened in part of the environment and are encoded in a short ego-centric video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this challenge, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent MLLMs indeed struggle to correctly answer "spatio-temporal" prompts. Building on this dataset, we study two spatio-temporal LLM (STLLM) baselines: 1) STLLM-3D, which directly fuses point cloud, video, and text representations as inputs to the LLM; and 2) STLLM-Aligner, which aligns spatial context with video and text before LLM decoding. Both baselines aim to enhance spatial understanding of environments and temporal grounding of egocentric observations. On REA, the STLLM baselines outperform existing models, demonstrating the effectiveness of our designs. Code and data are available at https://zoezheng126.github.io/STLLM-website/.

Spatio-Temporal LLM: Reasoning about Environments and Actions

TL;DR

Current multimodal LLMs struggle with prompts that require grounding both a full 3D environment and evolving egocentric actions. The paper introduces the REA dataset and two STLLM baselines (STLLM-3D and STLLM-Aligner) to fuse point-cloud context with video and text, and evaluates them with LLM judges. Results show substantial gains over existing MLLMs on REA, with STLLM-Aligner achieving the best performance and demonstrating cross-dataset generalization to SQA3D. This work provides a concrete step toward intrinsic spatio-temporal grounding for embodied AI and offers datasets and baselines to drive further research in joint 3D-spatial and temporal reasoning.

Abstract

Despite significant recent progress of Multimodal Large Language Models (MLLMs), current MLLMs are challenged by "spatio-temporal" prompts, i.e., prompts that refer to 1) the entirety of an environment encoded in a point cloud that the MLLM should consider; and simultaneously also refer to 2) actions that happened in part of the environment and are encoded in a short ego-centric video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this challenge, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent MLLMs indeed struggle to correctly answer "spatio-temporal" prompts. Building on this dataset, we study two spatio-temporal LLM (STLLM) baselines: 1) STLLM-3D, which directly fuses point cloud, video, and text representations as inputs to the LLM; and 2) STLLM-Aligner, which aligns spatial context with video and text before LLM decoding. Both baselines aim to enhance spatial understanding of environments and temporal grounding of egocentric observations. On REA, the STLLM baselines outperform existing models, demonstrating the effectiveness of our designs. Code and data are available at https://zoezheng126.github.io/STLLM-website/.

Paper Structure

This paper contains 28 sections, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Spatial and temporal reasoning is needed to answer prompts in "Reasoning about Environments and Actions" (REA). Ego-centric videos only show part of the point cloud environment.
  • Figure 2: Training data statistics.
  • Figure 3: Dataset generation pipeline. Note, in 2&3), camera poses (in green), sampled across the action interval, are used to compute the relative direction and distance between the person (moving along the arrow) and the object. To obtain per-frame camera poses for the query video, we first use VGGT wang2025vggtvisualgeometrygrounded to re-compute the point cloud (step 5) and subsequently apply Reloc3r dong2025reloc3rlargescaletrainingrelative (step 6).
  • Figure 4: Architectures of STLLM-3D and STLLM-Aligner.
  • Figure 5: Furniture Affordance Prediction example.
  • ...and 5 more figures