Egocentric and Exocentric Methods: A Short Survey
Anirudh Thatipelli, Shao-Yuan Lo, Amit K. Roy-Chowdhury
TL;DR
The paper surveys egocentric and exocentric learning, arguing that joint modeling of first-person and third-person views is crucial for advancing AI agents in vision tasks. It provides a structured taxonomy of methods—perception, cross-view learning, and generation—and reviews a growing set of paired ego-exo datasets, highlighting their modalities, synchronization, and annotations. Key contributions include a synthesis of identification, action recognition, tracking, synthesis, affordance analysis, transfer, and joint representation learning across views, along with discussions of current challenges and future directions. The work underscores the potential of large-scale, multi-modal ego-exo data and cross-view learning to improve robustness and generalization in real-world applications such as robotics, AR/VR, and human-centered AI systems.
Abstract
Egocentric vision captures the scene from the point of view of the camera wearer, while exocentric vision captures the overall scene context. Jointly modeling ego and exo views is crucial to developing next-generation AI agents. The community has regained interest in the field of egocentric vision. While the third-person view and first-person have been thoroughly investigated, very few works aim to study both synchronously. Exocentric videos contain many relevant signals that are transferrable to egocentric videos. This paper provides a timely overview of works combining egocentric and exocentric visions, a very new but promising research topic. We describe in detail the datasets and present a survey of the key applications of ego-exo joint learning, where we identify the most recent advances. With the presentation of the current status of the progress, we believe this short but timely survey will be valuable to the broad video-understanding community, particularly when multi-view modeling is critical.
