Table of Contents
Fetching ...

EgoPet: Egomotion and Interaction Data from an Animal's Perspective

Amir Bar, Arya Bakhtiar, Danny Tran, Antonio Loquercio, Jathushan Rajasegaran, Yann LeCun, Amir Globerson, Trevor Darrell

TL;DR

EgoPet introduces a large-scale egocentric animal video dataset (~$84$ hours) with egomotion and interaction data across diverse species. It defines three benchmarks—VIP for visual interactions, LP for forward trajectory prediction, and VPP for vision-to-proprioception transfer to legged locomotion—and demonstrates that pretraining on EgoPet yields strong downstream performance, especially for robotics-oriented tasks. The work shows EgoPet's potential to bridge the gap between animal-like perception and action and current AI capabilities, while revealing that interaction prediction remains substantially challenging. The dataset provides a foundation for self-supervised learning and robotics research, with future directions including multi-sensory integration (e.g., audio) to better model animal behavior.

Abstract

Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets.

EgoPet: Egomotion and Interaction Data from an Animal's Perspective

TL;DR

EgoPet introduces a large-scale egocentric animal video dataset (~ hours) with egomotion and interaction data across diverse species. It defines three benchmarks—VIP for visual interactions, LP for forward trajectory prediction, and VPP for vision-to-proprioception transfer to legged locomotion—and demonstrates that pretraining on EgoPet yields strong downstream performance, especially for robotics-oriented tasks. The work shows EgoPet's potential to bridge the gap between animal-like perception and action and current AI capabilities, while revealing that interaction prediction remains substantially challenging. The dataset provides a foundation for self-supervised learning and robotics research, with future directions including multi-sensory integration (e.g., audio) to better model animal behavior.

Abstract

Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets.
Paper Structure (17 sections, 9 figures, 5 tables)

This paper contains 17 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: We present EgoPet, a novel animal egocentric video dataset to advance learning animal-like behavior models from video (top row). We propose three benchmark tasks on this dataset (bottom row). Visual Interaction Prediction (VIP) and Locomotion Prediction (LP) are designed to predict animals' perception and action behavior. Finally, Vision to Proprioception Prediction (VPP) studies the utility of our dataset on the downstream task of robot locomotion in the wild. For all tasks, we find that models trained on EgoPet outperform those trained on previously available video datasets.
  • Figure 2: EgoPet video examples. Footage from the EgoPet dataset featuring four different animal experiences, each captured from an egocentric perspective at a distinct point in time.
  • Figure 3: Descriptive statistics. The histogram depicting the length (in seconds) of EgoPet video sequences exhibits a long-tailed distribution, primarily skewed toward shorter segments of less than $30$ seconds. Collectively, videos featuring dogs and cats account for 94% of the total duration, showcasing interactions with people, fellow cats and dogs, toys, and various objects.
  • Figure 4: Vision to Interaction Prediction task. The figure illustrates the process of annotating a single video, identifying and categorizing different interactions experienced by a cat, with each segment of the timeline reflecting a unique type of interaction within the animal's environment.
  • Figure 5: Locomotion Prediction task. A dog navigates an agility course, highlighting the concept of locomotion prediction by anticipating its forward and upward trajectory to clear the obstacle.
  • ...and 4 more figures