EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
Minjoon Jung, Junbin Xiao, Junghyun Kim, Byoung-Tak Zhang, Angela Yao
TL;DR
EgoExo-Con introduces a synchronized ego-exo video benchmark to probe cross-view temporal understanding in Video-LLMs across Temporal Verification and Temporal Grounding tasks. It reveals that current models struggle to maintain view-invariant reasoning, with cross-view consistency lagging behind single-view performance and naive multi-view fine-tuning offering limited gains. To address this, the authors propose View-GRPO, a reinforcement learning framework that guides viewpoint-specific temporal reasoning while aligning final conclusions, yielding stronger cross-view consistency than standard fine-tuning. The work also provides View30K, a dataset of reasoning instances for cross-view learning, and demonstrates that explicit reasoning rewards bolster cross-view robustness. Overall, EgoExo-Con and View-GRPO push toward true view-invariant video comprehension with practical implications for multi-perspective video understanding systems.
Abstract
Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.
