Table of Contents
Fetching ...

Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, Kevin Bailey, David Soriano Fosas, C. Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, Richard Newcombe

TL;DR

Nymeria tackles the need for a large-scale, multimodal, in-the-wild egocentric motion dataset with ground-truth full-body motion and synchronized multi-device data. It provides 300 hours of daily activities from 264 participants across 50 locations, along with 301.5K sentences and 8.64M words describing motion at multiple granularities, with open-source data and code. The paper details hardware synchronization, data processing including full-body retargeting and global alignment, and in-context motion-language annotations. It also presents baselines for motion tracking/synthesis and language-grounded motion tasks, highlighting Nymeria's potential to advance egocentric perception, language-grounded control, and scene understanding.

Abstract

We introduce Nymeria - a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices. The dataset comes with a) full-body ground-truth motion; b) multiple multimodal egocentric data from Project Aria devices with videos, eye tracking, IMUs and etc; and c) a third-person perspective by an additional observer. All devices are precisely synchronized and localized in on metric 3D world. We derive hierarchical protocol to add in-context language descriptions of human motion, from fine-grain motion narration, to simplified atomic action and high-level activity summarization. To the best of our knowledge, Nymeria dataset is the world's largest collection of human motion in the wild; first of its kind to provide synchronized and localized multi-device multimodal egocentric data; and the world's largest motion-language dataset. It provides 300 hours of daily activities from 264 participants across 50 locations, total travelling distance over 399Km. The language descriptions contain 301.5K sentences in 8.64M words from a vocabulary size of 6545. To demonstrate the potential of the dataset, we evaluate several SOTA algorithms for egocentric body tracking, motion synthesis, and action recognition. Data and code are open-sourced for research (c.f. https://www.projectaria.com/datasets/nymeria).

Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

TL;DR

Nymeria tackles the need for a large-scale, multimodal, in-the-wild egocentric motion dataset with ground-truth full-body motion and synchronized multi-device data. It provides 300 hours of daily activities from 264 participants across 50 locations, along with 301.5K sentences and 8.64M words describing motion at multiple granularities, with open-source data and code. The paper details hardware synchronization, data processing including full-body retargeting and global alignment, and in-context motion-language annotations. It also presents baselines for motion tracking/synthesis and language-grounded motion tasks, highlighting Nymeria's potential to advance egocentric perception, language-grounded control, and scene understanding.

Abstract

We introduce Nymeria - a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices. The dataset comes with a) full-body ground-truth motion; b) multiple multimodal egocentric data from Project Aria devices with videos, eye tracking, IMUs and etc; and c) a third-person perspective by an additional observer. All devices are precisely synchronized and localized in on metric 3D world. We derive hierarchical protocol to add in-context language descriptions of human motion, from fine-grain motion narration, to simplified atomic action and high-level activity summarization. To the best of our knowledge, Nymeria dataset is the world's largest collection of human motion in the wild; first of its kind to provide synchronized and localized multi-device multimodal egocentric data; and the world's largest motion-language dataset. It provides 300 hours of daily activities from 264 participants across 50 locations, total travelling distance over 399Km. The language descriptions contain 301.5K sentences in 8.64M words from a vocabulary size of 6545. To demonstrate the potential of the dataset, we evaluate several SOTA algorithms for egocentric body tracking, motion synthesis, and action recognition. Data and code are open-sourced for research (c.f. https://www.projectaria.com/datasets/nymeria).
Paper Structure (61 sections, 2 equations, 17 figures, 7 tables)

This paper contains 61 sections, 2 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: A glimpse of Nymeria dataset. The figure shows example indoor and outdoor activities captured on a campus, where the point clouds and trajectories are the SLAM output by tracking all egocentric devices i.e. the glasses and wristbands. Each sub-figure is a motion clip from a different participant, where the top left gives the latest egocentric view, the right is the 3D localized full-body motion synchronized with the headset and the bottom left provides an auxiliary third-person view.
  • Figure 2: Capture setup. (a) A full-dressed participant. (b) The set of hardwares including Project Aria glasses, two miniAria wristbands and synchronization device. (c) The sensor suite of Project Aria and (d) the miniAria wristband.
  • Figure 3: Diverse scenarios by diverse people. We show different participants performing different indoor/outdoor activities at different locations. In each subfigure, we show an egocentric view on the top left, a third-person view on the bottom left, and motion rendering on the right.
  • Figure 4: Global aligned trajectories and point clouds by locations. We show examples of split-level residential house with gardens, where each contain $\approx$5 hours of recording. The left shows the top-down views of accumulated trajectories where red, green and blue indicate the head, the left and right wrist. On the right we sample closed-up views where the clusters emerge from human 3D motion distribution.
  • Figure 5: End-to-end quality assessment. We uniformly sample a 20min recording over 1.5Km moving distance and project skeleton in observer's camera. The rendering and image aligns well, due to precise tracking and synchronization.
  • ...and 12 more figures