Table of Contents
Fetching ...

Egocentric and Exocentric Methods: A Short Survey

Anirudh Thatipelli, Shao-Yuan Lo, Amit K. Roy-Chowdhury

TL;DR

The paper surveys egocentric and exocentric learning, arguing that joint modeling of first-person and third-person views is crucial for advancing AI agents in vision tasks. It provides a structured taxonomy of methods—perception, cross-view learning, and generation—and reviews a growing set of paired ego-exo datasets, highlighting their modalities, synchronization, and annotations. Key contributions include a synthesis of identification, action recognition, tracking, synthesis, affordance analysis, transfer, and joint representation learning across views, along with discussions of current challenges and future directions. The work underscores the potential of large-scale, multi-modal ego-exo data and cross-view learning to improve robustness and generalization in real-world applications such as robotics, AR/VR, and human-centered AI systems.

Abstract

Egocentric vision captures the scene from the point of view of the camera wearer, while exocentric vision captures the overall scene context. Jointly modeling ego and exo views is crucial to developing next-generation AI agents. The community has regained interest in the field of egocentric vision. While the third-person view and first-person have been thoroughly investigated, very few works aim to study both synchronously. Exocentric videos contain many relevant signals that are transferrable to egocentric videos. This paper provides a timely overview of works combining egocentric and exocentric visions, a very new but promising research topic. We describe in detail the datasets and present a survey of the key applications of ego-exo joint learning, where we identify the most recent advances. With the presentation of the current status of the progress, we believe this short but timely survey will be valuable to the broad video-understanding community, particularly when multi-view modeling is critical.

Egocentric and Exocentric Methods: A Short Survey

TL;DR

The paper surveys egocentric and exocentric learning, arguing that joint modeling of first-person and third-person views is crucial for advancing AI agents in vision tasks. It provides a structured taxonomy of methods—perception, cross-view learning, and generation—and reviews a growing set of paired ego-exo datasets, highlighting their modalities, synchronization, and annotations. Key contributions include a synthesis of identification, action recognition, tracking, synthesis, affordance analysis, transfer, and joint representation learning across views, along with discussions of current challenges and future directions. The work underscores the potential of large-scale, multi-modal ego-exo data and cross-view learning to improve robustness and generalization in real-world applications such as robotics, AR/VR, and human-centered AI systems.

Abstract

Egocentric vision captures the scene from the point of view of the camera wearer, while exocentric vision captures the overall scene context. Jointly modeling ego and exo views is crucial to developing next-generation AI agents. The community has regained interest in the field of egocentric vision. While the third-person view and first-person have been thoroughly investigated, very few works aim to study both synchronously. Exocentric videos contain many relevant signals that are transferrable to egocentric videos. This paper provides a timely overview of works combining egocentric and exocentric visions, a very new but promising research topic. We describe in detail the datasets and present a survey of the key applications of ego-exo joint learning, where we identify the most recent advances. With the presentation of the current status of the progress, we believe this short but timely survey will be valuable to the broad video-understanding community, particularly when multi-view modeling is critical.

Paper Structure

This paper contains 12 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Hand-object interactions in the third-person view (right) are useful for identifying the action from the first-person viewpoint (left).
  • Figure 2: Ego-Exo Datasets and corresponding tasks. This figure illustrates the different Ego-Exo datasets in the literature and compares them to the associated benchmarks. Newly released Ego-Exo4D grauman2024ego, EgoExoLearn huang2024egoexolearn, EgoExoFitness li2024egoexofitnessegocentricexocentricfullbody constitute a large suite of novel tasks to further research in this arena.
  • Figure 3: Categorization of the joint egocentric and exocentric tasks. These tasks can be grouped into three major groups: Perception Tasks, Cross-View Learning and Generation. Perception Tasks include identification, tracking, action recognition, and affordance analysis, while Cross-View Learning encompasses ego-exo transfer and joint ego-exo works.
  • Figure 4: A modular framework for identification: A feature extraction module (graph/spatial) captures representations from the image inputs and fed into a matching module for identification. Images adapted from ardeshir2016ego2top.
  • Figure 5: A unified framework for joint egocentric-exocentric action recognition. A joint multi-view representation space is learned from the egocentric and exocentric views. At inference, the model utilizes this learned representation to make egocentric video predictions, leveraging knowledge from exocentric views for improved recognition. Images taken from damen2018scaling and kay2017kinetics.
  • ...and 5 more figures