Table of Contents
Fetching ...

StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios

Yifei Wang, Zhenkai Li, Tianwen Qian, Huanran Zheng, Zheng Wang, Yuqian Fu, Xiaoling Wang

TL;DR

StreamEQA introduces the first streaming-embodied benchmark for video question answering, enabling evaluation of how Video-LLMs reason over continuous, context-rich egocentric scenes. It organizes tasks along Embodied (Perception, Interaction, Planning) and Streaming (Backward, Real-time, Forward) axes across 156 long videos and ~21K time-stamped QA pairs, totaling 42 subtasks. Evaluations of 13 diverse Video-LLMs reveal substantial gaps in interaction and planning under streaming conditions, with noticeable penalties when models must reason over temporally evolving embodied content. The dataset construction leverages HD-EPIC annotations and GPT-5-driven meta-information extraction to produce grounded, time-stamped QA with refined distractors, providing a reproducible foundation for advancing temporally grounded, embodied streaming video understanding.

Abstract

As embodied intelligence advances toward real-world deployment, the ability to continuously perceive and reason over streaming visual inputs becomes essential. In such settings, an agent must maintain situational awareness of its environment, comprehend the interactions with surrounding entities, and dynamically plan actions informed by past observations, current contexts, and anticipated future events. To facilitate progress in this direction, we introduce StreamEQA, the first benchmark designed for streaming video question answering in embodied scenarios. StreamEQA evaluates existing MLLMs along two orthogonal dimensions: Embodied and Streaming. Along the embodied dimension, we categorize the questions into three levels: perception, interaction, and planning, which progressively assess a model's ability to recognize fine-grained visual details, reason about agent-object interactions, and perform high-level goal-directed reasoning. For the streaming dimension, questions are divided into backward, real-time, and forward reasoning, with each mode relying on a distinct temporal context. Built upon 156 independent long videos, StreamEQA defines 42 tasks and generates approximately 21K question-answer pairs with precise timestamps through a hybrid pipeline combining automated generation and human refinement. Evaluations of 13 state-of-the-art video-LLMs reveal that, despite strong performance on conventional benchmarks, these models still struggle with streaming video understanding in embodied scenarios. We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.

StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios

TL;DR

StreamEQA introduces the first streaming-embodied benchmark for video question answering, enabling evaluation of how Video-LLMs reason over continuous, context-rich egocentric scenes. It organizes tasks along Embodied (Perception, Interaction, Planning) and Streaming (Backward, Real-time, Forward) axes across 156 long videos and ~21K time-stamped QA pairs, totaling 42 subtasks. Evaluations of 13 diverse Video-LLMs reveal substantial gaps in interaction and planning under streaming conditions, with noticeable penalties when models must reason over temporally evolving embodied content. The dataset construction leverages HD-EPIC annotations and GPT-5-driven meta-information extraction to produce grounded, time-stamped QA with refined distractors, providing a reproducible foundation for advancing temporally grounded, embodied streaming video understanding.

Abstract

As embodied intelligence advances toward real-world deployment, the ability to continuously perceive and reason over streaming visual inputs becomes essential. In such settings, an agent must maintain situational awareness of its environment, comprehend the interactions with surrounding entities, and dynamically plan actions informed by past observations, current contexts, and anticipated future events. To facilitate progress in this direction, we introduce StreamEQA, the first benchmark designed for streaming video question answering in embodied scenarios. StreamEQA evaluates existing MLLMs along two orthogonal dimensions: Embodied and Streaming. Along the embodied dimension, we categorize the questions into three levels: perception, interaction, and planning, which progressively assess a model's ability to recognize fine-grained visual details, reason about agent-object interactions, and perform high-level goal-directed reasoning. For the streaming dimension, questions are divided into backward, real-time, and forward reasoning, with each mode relying on a distinct temporal context. Built upon 156 independent long videos, StreamEQA defines 42 tasks and generates approximately 21K question-answer pairs with precise timestamps through a hybrid pipeline combining automated generation and human refinement. Evaluations of 13 state-of-the-art video-LLMs reveal that, despite strong performance on conventional benchmarks, these models still struggle with streaming video understanding in embodied scenarios. We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.

Paper Structure

This paper contains 31 sections, 18 figures, 7 tables.

Figures (18)

  • Figure 1: A comparative overview of our StreamEQA and existing benchmarks. StreamEQA integrates both embodied and streaming requirements, enabling a more comprehensive evaluation of Video-LLMs in their progression toward real-world embodied applications.
  • Figure 2: Data construction pipeline of StreamEQA.
  • Figure 3: Overview of the StreamEQA task taxonomy and statistics. Top-left: The overall task taxonomy of the three main embodied levels. Top-right: The distributions of questions across embodied levels and temporal dimension. Bottom: A selection of representative QA examples for each major capability.
  • Figure 4: The performence of Qwen3VL on 42 Online-Embodied tasks which are categorized by the temporal dimension, while sorting tasks by accuracy within the same embodied level.
  • Figure 5: Online and Online-Embodied performance comparison. The results highlight the gap caused by embodied scenarios.
  • ...and 13 more figures