EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs
Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, Jiangmiao Pang
TL;DR
EgoExoBench introduces the first benchmark for cross-view egocentric-exocentric video understanding in multimodal LLMs, compiling 7,330 multiple-choice QA pairs across 11 subtasks spanning semantic alignment, spatial translation, and temporal reasoning. The dataset is built from six paired ego-exo sources and employs a rigorous QA construction and filtering pipeline to ensure visual grounding. Through zero-shot evaluations of 13 MLLMs, the study reveals strong single-view performance but persistent challenges in cross-view reasoning, with humans significantly outperforming models. Findings show limited benefits from chain-of-thought prompting and mixed gains from cross-perspective guidance, underscoring the need for architectures and training that jointly integrate multi-view visual and linguistic information for embodied intelligence.
Abstract
Transferring and integrating knowledge across first-person (egocentric) and third-person (exocentric) viewpoints is intrinsic to human intelligence, enabling humans to learn from others and convey insights from their own experiences. Despite rapid progress in multimodal large language models (MLLMs), their ability to perform such cross-view reasoning remains unexplored. To address this, we introduce EgoExoBench, the first benchmark for egocentric-exocentric video understanding and reasoning. Built from publicly available datasets, EgoExoBench comprises over 7,300 question-answer pairs spanning eleven sub-tasks organized into three core challenges: semantic alignment, viewpoint association, and temporal reasoning. We evaluate 13 state-of-the-art MLLMs and find that while these models excel on single-view tasks, they struggle to align semantics across perspectives, accurately associate views, and infer temporal dynamics in the ego-exo context. We hope EgoExoBench can serve as a valuable resource for research on embodied agents and intelligent assistants seeking human-like cross-view intelligence.
