Table of Contents
Fetching ...

Intention-driven Ego-to-Exo Video Generation

Hongchen Luo, Kai Zhu, Wei Zhai, Yang Cao

TL;DR

We address ego-to-exo video generation under drastic viewpoint changes by introducing IDE, which uses action intention—comprising human movement and action descriptions—as a view-invariant bridge. The method combines a cross-view feature perception module, a trajectory transformation module, and an action description unit within a diffusion-based latent flow framework to jointly generate exocentric motion and interaction content. Experiments on the LEMMA dataset show IDE outperforms state-of-the-art baselines in both perceptual and temporal metrics, validating its effectiveness for consistent cross-view video synthesis. This approach enables robust cross-perspective synthesis with potential benefits for AR/VR, embodied AI, and human-computer interaction, while acknowledging limitations in scenarios with minimal head motion and broader ethical considerations.

Abstract

Ego-to-exo video generation refers to generating the corresponding exocentric video according to the egocentric video, providing valuable applications in AR/VR and embodied AI. Benefiting from advancements in diffusion model techniques, notable progress has been achieved in video generation. However, existing methods build upon the spatiotemporal consistency assumptions between adjacent frames, which cannot be satisfied in the ego-to-exo scenarios due to drastic changes in views. To this end, this paper proposes an Intention-Driven Ego-to-exo video generation framework (IDE) that leverages action intention consisting of human movement and action description as view-independent representation to guide video generation, preserving the consistency of content and motion. Specifically, the egocentric head trajectory is first estimated through multi-view stereo matching. Then, cross-view feature perception module is introduced to establish correspondences between exo- and ego- views, guiding the trajectory transformation module to infer human full-body movement from the head trajectory. Meanwhile, we present an action description unit that maps the action semantics into the feature space consistent with the exocentric image. Finally, the inferred human movement and high-level action descriptions jointly guide the generation of exocentric motion and interaction content (i.e., corresponding optical flow and occlusion maps) in the backward process of the diffusion model, ultimately warping them into the corresponding exocentric video. We conduct extensive experiments on the relevant dataset with diverse exo-ego video pairs, and our IDE outperforms state-of-the-art models in both subjective and objective assessments, demonstrating its efficacy in ego-to-exo video generation.

Intention-driven Ego-to-Exo Video Generation

TL;DR

We address ego-to-exo video generation under drastic viewpoint changes by introducing IDE, which uses action intention—comprising human movement and action descriptions—as a view-invariant bridge. The method combines a cross-view feature perception module, a trajectory transformation module, and an action description unit within a diffusion-based latent flow framework to jointly generate exocentric motion and interaction content. Experiments on the LEMMA dataset show IDE outperforms state-of-the-art baselines in both perceptual and temporal metrics, validating its effectiveness for consistent cross-view video synthesis. This approach enables robust cross-perspective synthesis with potential benefits for AR/VR, embodied AI, and human-computer interaction, while acknowledging limitations in scenarios with minimal head motion and broader ethical considerations.

Abstract

Ego-to-exo video generation refers to generating the corresponding exocentric video according to the egocentric video, providing valuable applications in AR/VR and embodied AI. Benefiting from advancements in diffusion model techniques, notable progress has been achieved in video generation. However, existing methods build upon the spatiotemporal consistency assumptions between adjacent frames, which cannot be satisfied in the ego-to-exo scenarios due to drastic changes in views. To this end, this paper proposes an Intention-Driven Ego-to-exo video generation framework (IDE) that leverages action intention consisting of human movement and action description as view-independent representation to guide video generation, preserving the consistency of content and motion. Specifically, the egocentric head trajectory is first estimated through multi-view stereo matching. Then, cross-view feature perception module is introduced to establish correspondences between exo- and ego- views, guiding the trajectory transformation module to infer human full-body movement from the head trajectory. Meanwhile, we present an action description unit that maps the action semantics into the feature space consistent with the exocentric image. Finally, the inferred human movement and high-level action descriptions jointly guide the generation of exocentric motion and interaction content (i.e., corresponding optical flow and occlusion maps) in the backward process of the diffusion model, ultimately warping them into the corresponding exocentric video. We conduct extensive experiments on the relevant dataset with diverse exo-ego video pairs, and our IDE outperforms state-of-the-art models in both subjective and objective assessments, demonstrating its efficacy in ego-to-exo video generation.
Paper Structure (15 sections, 14 equations, 10 figures, 4 tables)

This paper contains 15 sections, 14 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Ego-to-exo video generation. Given an egocentric video and an initial frame of an exocentric, generate an exocentric video for the corresponding scene.
  • Figure 2: Motivation.(a) The human action intention consists of the human movement and the action description. The human movement can be obtained indirectly from the head trajectory. (b) Utilizing objects co-occurring in the exocentric and egocentric, it is feasible to establish a connection between the two perspectives and achieve content alignment.
  • Figure 3: The ego-to-exo video generation pipeline. (a) Human movement is inferred from the head trajectory and the relationship between the two views, while the text encoder maps the action description to the feature space consistent with the exocentric image. These two components serve as conditional inputs to the backward process of the diffusion model, guiding the generation of corresponding optical flow and occlusion maps. (b) Mining of objects shared by different viewpoints to establish content alignment between the two views.
  • Figure 4: Intention-Driven Ego-to-exo video generation framework (IDE). The cross-view feature perception module (CFPM) uses class tokens from different viewpoints to mine objects common to exocentric and egocentric video frames to establish connections between regions between ego-to-exo view. The trajectory transformation module first utilizes the head motion to adjust the dynamic distribution of the egocentric features temporally. Then, it leverages the ego-to-exo connections established by the CFPM to transfer the motion information to the exocentric features and the action description to provide more accurate interaction cues.
  • Figure 5: The results for exocentric video generation with different methods in Seen setting. The yellow box represents the first frame of the egocentric video and the red box represents the first frame of the real exocentric video.
  • ...and 5 more figures