Table of Contents
Fetching ...

EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations

Junho Park, Andrew Sangwoo Ye, Taein Kwon

TL;DR

This work introduces EgoWorld, a novel framework that reconstructs an egocentric view from rich exocentric observations, including point clouds, 3D hand poses, and textual descriptions, and exhibits robustness on in-the-wild examples, underscoring its practical applicability.

Abstract

Egocentric vision is essential for both human and machine visual understanding, particularly in capturing the detailed hand-object interactions needed for manipulation tasks. Translating third-person views into first-person views significantly benefits augmented reality (AR), virtual reality (VR) and robotics applications. However, current exocentric-to-egocentric translation methods are limited by their dependence on 2D cues, synchronized multi-view settings, and unrealistic assumptions such as the necessity of an initial egocentric frame and relative camera poses during inference. To overcome these challenges, we introduce EgoWorld, a novel framework that reconstructs an egocentric view from rich exocentric observations, including point clouds, 3D hand poses, and textual descriptions. Our approach reconstructs a point cloud from estimated exocentric depth maps, reprojects it into the egocentric perspective, and then applies diffusion model to produce dense, semantically coherent egocentric images. Evaluated on four datasets (i.e., H2O, TACO, Assembly101, and Ego-Exo4D), EgoWorld achieves state-of-the-art performance and demonstrates robust generalization to new objects, actions, scenes, and subjects. Moreover, EgoWorld exhibits robustness on in-the-wild examples, underscoring its practical applicability. Project page is available at https://redorangeyellowy.github.io/EgoWorld/.

EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations

TL;DR

This work introduces EgoWorld, a novel framework that reconstructs an egocentric view from rich exocentric observations, including point clouds, 3D hand poses, and textual descriptions, and exhibits robustness on in-the-wild examples, underscoring its practical applicability.

Abstract

Egocentric vision is essential for both human and machine visual understanding, particularly in capturing the detailed hand-object interactions needed for manipulation tasks. Translating third-person views into first-person views significantly benefits augmented reality (AR), virtual reality (VR) and robotics applications. However, current exocentric-to-egocentric translation methods are limited by their dependence on 2D cues, synchronized multi-view settings, and unrealistic assumptions such as the necessity of an initial egocentric frame and relative camera poses during inference. To overcome these challenges, we introduce EgoWorld, a novel framework that reconstructs an egocentric view from rich exocentric observations, including point clouds, 3D hand poses, and textual descriptions. Our approach reconstructs a point cloud from estimated exocentric depth maps, reprojects it into the egocentric perspective, and then applies diffusion model to produce dense, semantically coherent egocentric images. Evaluated on four datasets (i.e., H2O, TACO, Assembly101, and Ego-Exo4D), EgoWorld achieves state-of-the-art performance and demonstrates robust generalization to new objects, actions, scenes, and subjects. Moreover, EgoWorld exhibits robustness on in-the-wild examples, underscoring its practical applicability. Project page is available at https://redorangeyellowy.github.io/EgoWorld/.

Paper Structure

This paper contains 33 sections, 6 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: EgoWorld translates a single exocentric view into an egocentric view. By leveraging rich multi-modal exocentric observations, such as point clouds, 3D hand poses, and textual descriptions, EgoWorld is able to generate high-quality egocentric views, even in unseen scenarios. Each observed modality provides complementary information that contributes to the accurate and realistic reconstruction of the egocentric view.
  • Figure 2: Overall framework of EgoWorld.EgoWorld has a two-stage pipeline : (1) Exocentric view observation $\Phi_{exo}$, which extracts diverse observations from the exocentric view, including projected point clouds, 3D hand poses, and textual descriptions; and (2) egocentric view reconstruction $\Phi_{ego}$, which reconstructs the egocentric view based on cues from the exocentric view observation.
  • Figure 3: Comparisons with state-of-the-arts on unseen scenarios (i.e., objects, actions, scenes, and subjects) in H2O kwon2021h2o. Compared to state-of-the-arts (i.e., pix2pixHD wang2018high, pixelNeRF yu2021pixelnerf, and CFLD lu2024coarse), EgoWorld outperforms the image reconstruction quality with respect to hand-object interaction and background regions for all unseen scenarios.
  • Figure 4: Comparisons with state-of-the-art on unseen actions scenario in TACO liu2024taco, Assembly101 sener2022assembly101, and Ego-Exo4D grauman2024ego. Compared to state-of-the-art (i.e., CFLD lu2024coarse), EgoWorld outperforms the image reconstruction quality with respect to hand-object interaction and background regions even on more challenging scenarios than H2O kwon2021h2o.
  • Figure 5: Real-world comparisons with state-of-the-art. Compared to state-of-the-art (i.e., CFLD lu2024coarse), EgoWorld significantly outperforms with respect to hand-object interaction and background regions for in-the-wild scenarios.
  • ...and 7 more figures