Table of Contents
Fetching ...

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

Jingli Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, Jiangmiao Pang

TL;DR

OST-Bench introduces an online, embodied benchmark for spatio-temporal reasoning, emphasizing incremental perception and memory integration. It evaluates a range of MLLMs on 1.4k real-world scenes with 10k QA in a multi-round dialogue over streamed observations, revealing substantial gaps to human performance, especially in dynamic state and spatial relation tasks. The analysis identifies a spatio-temporal reasoning shortcut and demonstrates that both complex reasoning and long-term memory retrieval are key bottlenecks, with fine-tuning offering only modest gains. The work provides extensive methodological details, error analyses, and cross-view assessments, offering a robust platform to drive future advances in online embodied understanding, while acknowledging current limitations such as static environments and lack of interaction.

Abstract

Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

TL;DR

OST-Bench introduces an online, embodied benchmark for spatio-temporal reasoning, emphasizing incremental perception and memory integration. It evaluates a range of MLLMs on 1.4k real-world scenes with 10k QA in a multi-round dialogue over streamed observations, revealing substantial gaps to human performance, especially in dynamic state and spatial relation tasks. The analysis identifies a spatio-temporal reasoning shortcut and demonstrates that both complex reasoning and long-term memory retrieval are key bottlenecks, with fine-tuning offering only modest gains. The work provides extensive methodological details, error analyses, and cross-view assessments, offering a robust platform to drive future advances in online embodied understanding, while acknowledging current limitations such as static environments and lack of interaction.

Abstract

Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/

Paper Structure

This paper contains 33 sections, 17 figures, 5 tables.

Figures (17)

  • Figure 1: OST-Bench is designed from the perspective of an embodied agent dynamically exploring static indoor environments, with a focus on online and spatio-temporal understanding. Compared to the conventional offline setting (top right), which answers questions based on a fixed-length video of the scene, the bottom section illustrates our online setting: for the same question, the agent’s answers evolve as it explores the scene, changing from blue (t1) to red (t2) to green (t3), reflecting its continuously updated understanding.
  • Figure 2: OST-Bench categorizes questions into three main categories. Each main category includes several subtypes; in total, the benchmark comprises 15 fine-grained question subtypes.
  • Figure 3: Model performance over exploration time. The right side shows a general decline in answer accuracy for all models; the left side illustrates the accuracy trends across three main categories for InternVL-2.5-38B and GPT-4.1.
  • Figure 4: Distribution of three error types across the three task categories in OST-Bench.
  • Figure 5: An example of Spatio-temporal Reasoning Shortcut, the green text indicates correct reasoning by the model, while the red text highlights wrong reasoning.
  • ...and 12 more figures