Table of Contents
Fetching ...

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

Chengjun Yu, Xuhan Zhu, Chaoqun Du, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha

TL;DR

A new task, Egocentric Scene Prediction with LOng-horizon REasoning, and a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception, which introduces EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios.

Abstract

Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

TL;DR

A new task, Egocentric Scene Prediction with LOng-horizon REasoning, and a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception, which introduces EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios.

Abstract

Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.
Paper Structure (29 sections, 2 equations, 13 figures, 5 tables)

This paper contains 29 sections, 2 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview of EXPLORE-Bench. EXPLORE-Bench evaluates MLLMs on a new task: egocentric scene prediction with long-horizon reasoning. We annotate the final scene at the object, attribute, and relation levels to enable fine-grained scene-level evaluation. Note that the prompt is abbreviated for brevity in this figure.
  • Figure 2: Data analysis of EXPLORE-Bench reflects the rich diversity of scenarios, objects, attributes, relations, and atomic actions.
  • Figure 3: Illustration of our scene annotation pipeline. Rather than having the MLLM generate annotations directly from the images, we adopt a multi-step pipeline to ensure object coverage and the accuracy of attributes and relations, greatly reducing the load of manual annotation required.
  • Figure 4: Unified score $\bm{S_{uni}}$ of Qwen3-VL-8B-Instruct across subsets under different inference strategies.Short, Medium, and Long denote the subsets with short, medium, and long atomic-action sequences. Full denotes the full dataset.
  • Figure 5: Average single-instance final-scene description length of Qwen3-VL-8B-Instruct under different inference strategies.
  • ...and 8 more figures