Table of Contents
Fetching ...

EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

Minjoon Jung, Junbin Xiao, Junghyun Kim, Byoung-Tak Zhang, Angela Yao

TL;DR

EgoExo-Con introduces a synchronized ego-exo video benchmark to probe cross-view temporal understanding in Video-LLMs across Temporal Verification and Temporal Grounding tasks. It reveals that current models struggle to maintain view-invariant reasoning, with cross-view consistency lagging behind single-view performance and naive multi-view fine-tuning offering limited gains. To address this, the authors propose View-GRPO, a reinforcement learning framework that guides viewpoint-specific temporal reasoning while aligning final conclusions, yielding stronger cross-view consistency than standard fine-tuning. The work also provides View30K, a dataset of reasoning instances for cross-view learning, and demonstrates that explicit reasoning rewards bolster cross-view robustness. Overall, EgoExo-Con and View-GRPO push toward true view-invariant video comprehension with practical implications for multi-perspective video understanding systems.

Abstract

Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.

EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

TL;DR

EgoExo-Con introduces a synchronized ego-exo video benchmark to probe cross-view temporal understanding in Video-LLMs across Temporal Verification and Temporal Grounding tasks. It reveals that current models struggle to maintain view-invariant reasoning, with cross-view consistency lagging behind single-view performance and naive multi-view fine-tuning offering limited gains. To address this, the authors propose View-GRPO, a reinforcement learning framework that guides viewpoint-specific temporal reasoning while aligning final conclusions, yielding stronger cross-view consistency than standard fine-tuning. The work also provides View30K, a dataset of reasoning instances for cross-view learning, and demonstrates that explicit reasoning rewards bolster cross-view robustness. Overall, EgoExo-Con and View-GRPO push toward true view-invariant video comprehension with practical implications for multi-perspective video understanding systems.

Abstract

Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.

Paper Structure

This paper contains 30 sections, 5 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Examples of queries and corresponding video moments from existing datasets. (a) and (b) highlight fundamental limitations, with the egocentric view (top) in (a) being insufficient due to differing focuses, and the exocentric view (bottom) in (b) being ambiguous due to occlusion and distance. Although the query in (c) is identifiable from both viewpoints, we enrich it with details.
  • Figure 2: Statistics of EgoExo-Con. The numbers below (a) show the video and moment counts per subset, and those in (b) and (c) show their average lengths, respectively. The statistics suggest the high diversity of EgoExo-Con in data sources, video and moment lengths.
  • Figure 3: Examples of test data and the corresponding model responses. We create refined and misaligned queries from each original query, use them for temporal verification (V) and grounding (G), and assess cross-view answer consistency.
  • Figure 4: Heatmaps of the performance gap. All values are reported in percentage points. Red and blue indicate higher performances on ego and exo perspectives, respectively. i.e., a blue cell indicates that the corresponding model performs better on exocentric videos than on egocentric ones.
  • Figure 5: Overview of our approach. (a) In supervised fine-tuning, the model is trained to directly predict the same query answers (e.g., video moments) for synchronized video pairs. (b) View-GRPO trains a model to provide viewpoint-specific reasoning chains, which are generated by GPT-5 (top).
  • ...and 9 more figures