Table of Contents
Fetching ...

EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos

Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie

TL;DR

This work tackles cross-view video prediction by animating ego-centric frames from synchronized exo-centric video, a starting ego frame, and a text instruction. It introduces EgoExo-Gen, a two-stage pipeline that first predicts hand-object interaction masks in ego-view using cross-view reasoning, then generates future ego-centric frames via an HOI-aware latent diffusion process guided by the predicted masks and the initial frame. An automated HOI mask annotation pipeline enables scalable training across ego- and exo-centric videos. Empirical results on Ego-Exo4D and H2O show state-of-the-art performance with improved realism of hands and interactive objects and strong zero-shot transfer, underscoring the method’s potential for AR and embodied AI tasks.

Abstract

Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present EgoExo-Gen that explicitly models the hand-object dynamics for cross-view video prediction. EgoExo-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop an automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed EgoExo-Gen achieves better prediction performance compared to previous video prediction models on the Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.

EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos

TL;DR

This work tackles cross-view video prediction by animating ego-centric frames from synchronized exo-centric video, a starting ego frame, and a text instruction. It introduces EgoExo-Gen, a two-stage pipeline that first predicts hand-object interaction masks in ego-view using cross-view reasoning, then generates future ego-centric frames via an HOI-aware latent diffusion process guided by the predicted masks and the initial frame. An automated HOI mask annotation pipeline enables scalable training across ego- and exo-centric videos. Empirical results on Ego-Exo4D and H2O show state-of-the-art performance with improved realism of hands and interactive objects and strong zero-shot transfer, underscoring the method’s potential for AR and embodied AI tasks.

Abstract

Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present EgoExo-Gen that explicitly models the hand-object dynamics for cross-view video prediction. EgoExo-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop an automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed EgoExo-Gen achieves better prediction performance compared to previous video prediction models on the Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.

Paper Structure

This paper contains 12 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The cross-view video prediction task aims to predict future RGB frames of the ego-centric video, given the first ego-centric frame, a text instruction, and a synchronised exo-centric video.
  • Figure 2: An overview of EgoExo-Gen. Given an exo-centric video, a text instruction, and the first frame of an ego-centric video, (1) a cross-view mask prediction model first anticipates the hand-object masks of the unobserved future frames; (2) an HOI-aware video diffusion model then predicts future frames of an ego-centric video by incorporating the predicted hand-object masks.
  • Figure 3: Ego-Exo mask annotation pipeline. We first perform frame-wise annotation with hand-object detector/segmentor, and prompt SAM-2 to track HOI masks in the video.
  • Figure 4: Qualitative comparisons. EgoExo-Gen (w/o future) refers to our default model using the predicted HOI masks as condition. EgoExo-Gen (w/ future) uses the HOI masks extracted from visible future frames (Sec: \ref{['subsec:training']}), serving as an oracle. The last row shows a failure case with complex hand movement. Best viewed with Acrobat Reader. Click the image to view the animated videos.
  • Figure 5: Visualisation of the exo-centric hand-object masks and predicted ego-centric masks at the visible 1$^{st}$ frame and invisible 5$^{th}$ and 10$^{th}$ frames.