AoE: Always-on Egocentric Human Video Collection for Embodied AI

Bowen Yang; Zishuo Li; Yang Sun; Changtao Miao; Yifan Yang; Man Luo; Xiaotong Yan; Feng Jiang; Jinchuan Shi; Yankai Fu; Ning Chen; Junkai Zhao; Pengwei Wang; Guocai Yao; Shanghang Zhang; Hao Chen; Zhe Li; Kai Zhu

AoE: Always-on Egocentric Human Video Collection for Embodied AI

Bowen Yang, Zishuo Li, Yang Sun, Changtao Miao, Yifan Yang, Man Luo, Xiaotong Yan, Feng Jiang, Jinchuan Shi, Yankai Fu, Ning Chen, Junkai Zhao, Pengwei Wang, Guocai Yao, Shanghang Zhang, Hao Chen, Zhe Li, Kai Zhu

TL;DR

The Always-on Egocentric (AoE) data collection system is proposed, which aims to simplify hardware dependencies by leveraging humans themselves and their smartphones, enabling low-cost, highly efficient, and scene-agnostic real-world interaction data collection to address the challenge of data scarcity.

Abstract

Embodied foundation models require large-scale, high-quality real-world interaction data for pre-training and scaling. However, existing data collection methods suffer from high infrastructure costs, complex hardware dependencies, and limited interaction scope, making scalable expansion challenging. In fact, humans themselves are ideal physically embodied agents. Therefore, obtaining egocentric real-world interaction data from globally distributed "human agents" offers advantages of low cost and sustainability. To this end, we propose the Always-on Egocentric (AoE) data collection system, which aims to simplify hardware dependencies by leveraging humans themselves and their smartphones, enabling low-cost, highly efficient, and scene-agnostic real-world interaction data collection to address the challenge of data scarcity. Specifically, we first employ an ergonomic neck-mounted smartphone holder to enable low-barrier, large-scale egocentric data collection through a cloud-edge collaborative architecture. Second, we develop a cross-platform mobile APP that leverages on-device compute for real-time processing, while the cloud hosts automated labeling and filtering pipelines that transform raw videos into high-quality training data. Finally, the AoE system supports distributed Ego video data collection by anyone, anytime, and anywhere. We evaluate AoE on data preprocessing quality and downstream tasks, demonstrating that high-quality egocentric data significantly boosts real-world generalization.

AoE: Always-on Egocentric Human Video Collection for Embodied AI

TL;DR

Abstract

Paper Structure (34 sections, 9 figures, 2 tables)

This paper contains 34 sections, 9 figures, 2 tables.

Introduction
Related Works
Data Collection with In-The-Wild Equipment
Manipulation Policies Learning from Human Videos
Always-On Egocentric Data Collection System
Hardware & Mobile Application
Automated Annotation and Quality Filtering Pipeline
Distributed System Implementation
Experiments
Precision of the AoE System
Experimental Setup.
Real-to-Sim Transferability
Real-World Evaluation on Humanoid Hardware
System Configuration.
Task Suite.
...and 19 more sections

Figures (9)

Figure 1: Overview of the AoE system. The system leverages neck-mounted smartphones for ubiquitous egocentric capture (Left). Our edge-cloud collaborative pipeline (Middle) efficiently distributes computation: on-device models handle real-time detection and selective uploading, while cloud servers execute heavy-duty auto-labeling and quality filtering. This design minimizes hardware dependencies, enabling scalable, high-quality data collection in the wild (Right).
Figure 2: Overview of Hardware & Mobile Application. The AOE hardware supports various ergonomic mounts (Mechanical, MagSafe, Magnetic) with stabilizing straps for robust, all-day egocentric recording (Left). The user-friendly UI Interface for users to manage recordings (Up Right). On-device intelligence selectively records high-value manipulation data (Bottom Right). Secure pipeline that synchronizes user-authorized data to the cloud (Right).
Figure 3: Overview of the Automatic Annotation and Augmentation Pipeline. (a) Undistort videos and segment videos into atomic clips. (b) Dense depth maps yield camera trajectories and scene reconstruction. (c) Hand poses are generated and transformed to world coordinates. (d) Augmentation employs generative background replacement and simulation-based robot inpainting.
Figure 4: Distributed Edge-Cloud Architecture. Enabling low-latency edge-to-cloud synchronization, the system utilizes a configurable, elastically scaled pipeline to generate multi-modal data for robot policy learning.
Figure 5: Comparison of data processing acquisition methods and accuracy. (a) Depth camera acquisition configuration. (b) AR glasses + smartphone combined acquisition configuration. (c) Hand modeling comparison using AR glasses + smartphone. (d) Hand modeling comparison from EgoDex. (e) Camera trajectory comparison after rotation-translation alignment from EgoDex. (f) Hand reconstruction accuracy and camera trajectory reconstruction accuracy.
...and 4 more figures

AoE: Always-on Egocentric Human Video Collection for Embodied AI

TL;DR

Abstract

AoE: Always-on Egocentric Human Video Collection for Embodied AI

Authors

TL;DR

Abstract

Table of Contents

Figures (9)