Table of Contents
Fetching ...

Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman

TL;DR

A generative framework called Exo2Ego is proposed that decouples the translation process into two stages: high-level structure transformation, which explicitly encourages cross-view correspondence between exocentric and egocentric views, and a diffusion-based pixel-level hallucination, which incorporates a hand layout prior to enhance the fidelity of the generated egocentric view.

Abstract

We investigate exocentric-to-egocentric cross-view translation, which aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective. To this end, we propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation, which explicitly encourages cross-view correspondence between exocentric and egocentric views, and a diffusion-based pixel-level hallucination, which incorporates a hand layout prior to enhance the fidelity of the generated egocentric view. To pave the way for future advancements in this field, we curate a comprehensive exo-to-ego cross-view translation benchmark. It consists of a diverse collection of synchronized ego-exo tabletop activity video pairs sourced from three public datasets: H2O, Aria Pilot, and Assembly101. The experimental results validate that Exo2Ego delivers photorealistic video results with clear hand manipulation details and outperforms several baselines in terms of both synthesis quality and generalization ability to new actions.

Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

TL;DR

A generative framework called Exo2Ego is proposed that decouples the translation process into two stages: high-level structure transformation, which explicitly encourages cross-view correspondence between exocentric and egocentric views, and a diffusion-based pixel-level hallucination, which incorporates a hand layout prior to enhance the fidelity of the generated egocentric view.

Abstract

We investigate exocentric-to-egocentric cross-view translation, which aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective. To this end, we propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation, which explicitly encourages cross-view correspondence between exocentric and egocentric views, and a diffusion-based pixel-level hallucination, which incorporates a hand layout prior to enhance the fidelity of the generated egocentric view. To pave the way for future advancements in this field, we curate a comprehensive exo-to-ego cross-view translation benchmark. It consists of a diverse collection of synchronized ego-exo tabletop activity video pairs sourced from three public datasets: H2O, Aria Pilot, and Assembly101. The experimental results validate that Exo2Ego delivers photorealistic video results with clear hand manipulation details and outperforms several baselines in terms of both synthesis quality and generalization ability to new actions.
Paper Structure (34 sections, 4 equations, 14 figures, 3 tables)

This paper contains 34 sections, 4 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Cross-view translation from exocentric to egocentric video. Given an exocentric video sequence (top), the goal is to generate the corresponding egocentric perspective (bottom).
  • Figure 2: Our Exo2Ego framework comprises two modules: (a) High-level Structure Transformation, which predicts the ego layout, capturing hand position and interactions using a transformer-based encoder-decoder architecture. (b) Diffusion-based Pixel Hallucination, which enhances pixel-level details on top of the ego layout using a conditional diffusion model.
  • Figure 3: Qualitative examples when generalizing to new actions on all datasets. More examples are in Appendix.
  • Figure 4: Qualitative examples when generalizing to new objects, subjects, and scenes (backgrounds) on H2O dataset.
  • Figure 5: Exo2Ego framework generates ego videos with reasonable viewpoint changes.
  • ...and 9 more figures