Table of Contents
Fetching ...

Aria Everyday Activities Dataset

Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huixuan Tang, Jeff Meissner, Jing Dong, Kiran Somasundaram, Luis Pesqueira, Mark Schwesinger, Omkar Parkhi, Qiao Gu, Renzo De Nardi, Shangyi Cheng, Steve Saarinen, Vijay Baiyya, Yuyang Zou, Richard Newcombe, Jakob Julian Engel, Xiaqing Pan, Carl Ren

TL;DR

The paper presents the Aria Everyday Activities (AEA) dataset, a 4D longitudinal, egocentric multimodal resource captured with Project Aria glasses to enable context-aware AI in daily life settings. It details the dataset's extensive sensor suite, machine perception outputs, precise 3D alignment, and time synchronization across devices, along with privacy safeguards and open-source tooling. The authors demonstrate two exemplar applications: 3D neural scene reconstruction using Gaussian Splatting and NeRFstudio, and prompted segmentation driven by eye gaze and speech prompts, underscoring the dataset's potential to advance persistent scene understanding and interactive AI. By providing rich multimodal data and accessible tools, AEA aims to catalyze research in longitudinal, contextually grounded AI for everyday activities.

Abstract

We present Aria Everyday Activities (AEA) Dataset, an egocentric multimodal open dataset recorded using Project Aria glasses. AEA contains 143 daily activity sequences recorded by multiple wearers in five geographically diverse indoor locations. Each of the recording contains multimodal sensor data recorded through the Project Aria glasses. In addition, AEA provides machine perception data including high frequency globally aligned 3D trajectories, scene point cloud, per-frame 3D eye gaze vector and time aligned speech transcription. In this paper, we demonstrate a few exemplar research applications enabled by this dataset, including neural scene reconstruction and prompted segmentation. AEA is an open source dataset that can be downloaded from https://www.projectaria.com/datasets/aea/. We are also providing open-source implementations and examples of how to use the dataset in Project Aria Tools https://github.com/facebookresearch/projectaria_tools.

Aria Everyday Activities Dataset

TL;DR

The paper presents the Aria Everyday Activities (AEA) dataset, a 4D longitudinal, egocentric multimodal resource captured with Project Aria glasses to enable context-aware AI in daily life settings. It details the dataset's extensive sensor suite, machine perception outputs, precise 3D alignment, and time synchronization across devices, along with privacy safeguards and open-source tooling. The authors demonstrate two exemplar applications: 3D neural scene reconstruction using Gaussian Splatting and NeRFstudio, and prompted segmentation driven by eye gaze and speech prompts, underscoring the dataset's potential to advance persistent scene understanding and interactive AI. By providing rich multimodal data and accessible tools, AEA aims to catalyze research in longitudinal, contextually grounded AI for everyday activities.

Abstract

We present Aria Everyday Activities (AEA) Dataset, an egocentric multimodal open dataset recorded using Project Aria glasses. AEA contains 143 daily activity sequences recorded by multiple wearers in five geographically diverse indoor locations. Each of the recording contains multimodal sensor data recorded through the Project Aria glasses. In addition, AEA provides machine perception data including high frequency globally aligned 3D trajectories, scene point cloud, per-frame 3D eye gaze vector and time aligned speech transcription. In this paper, we demonstrate a few exemplar research applications enabled by this dataset, including neural scene reconstruction and prompted segmentation. AEA is an open source dataset that can be downloaded from https://www.projectaria.com/datasets/aea/. We are also providing open-source implementations and examples of how to use the dataset in Project Aria Tools https://github.com/facebookresearch/projectaria_tools.
Paper Structure (39 sections, 8 figures, 4 tables)

This paper contains 39 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: An overview of Aria Everyday Activities (AEA) dataset using some exemplar activities recorded in Location 1. On the right column, we highlight a time-synchronized snapshot of two wearers talking to each other in one activity, with the following information representing each of the viewer in red and green: (1) their high-frequency 6DoF close-loop trajectories, (2) observed point cloud, (3) RGB camera view frustum, (4) monochrome scene cameras, (5) eyetracking cameras, (6) their projected eye gaze on all three camera streams and (7) transcribed speech. On the left side, we also highlight a diverse set of activities (e.g. dining, doing laundry, folding clothes, cooking) with the projected eyetracking (green dot) on the RGB streams. All of the recordings contain close-loop trajectories (white lines) spatially aligned on the environment point cloud (semi-dense points).
  • Figure 2: An overview of components on Project Aria device, with exemplar senor configuration of each. In this dataset, we use Profile9 which include all the sensors of the device except the WiFi, Bluetooth & GNSS. We visualize a snapshot of all the sensors in one of the recordings on the right.
  • Figure 3: A visualization of the shared 3D global closed-loop trajectories and the semi-dense point clouds for multi-recording activities on every 5 location. Each color indicates the high frequency trajectory of one sequence recorded in this location. In each location, the point clouds are aggregated from all recordings. Location 3 and 5 are shown sideways to highlight the multi-floor scenarios.
  • Figure 4: We manually blurred all the human faces in both RGB color video (right) and two monochrome scene camera videos (cropped in the left two images).
  • Figure 5: A snapshot of AEA dataset viewer (multiple-person activity) showing the synchronized rectified RGB streams with their devices trajectories (dark red and green of each), eye gaze vector direction (a red vector from each device frustum), the projected eye gaze on RGB image (red dot in each image), the aggregated 3D semi-dense point cloud for each recording (white) and the transcribed speech sentences (overlaid on each RGB image). We provide this viewer in project aria tools.
  • ...and 3 more figures