Aria-NeRF: Multimodal Egocentric View Synthesis

Jiankai Sun; Jianing Qiu; Chuanyang Zheng; John Tucker; Javier Yu; Mac Schwager

Aria-NeRF: Multimodal Egocentric View Synthesis

Jiankai Sun, Jianing Qiu, Chuanyang Zheng, John Tucker, Javier Yu, Mac Schwager

TL;DR

A comprehensive multimodal egocentric video dataset, featuring RGB images, eye-tracking camera footage, audio recordings from a microphone, atmospheric pressure readings from a barometer, positional coordinates from GPS, connectivity details from Wi-Fi and Bluetooth, and information from dual-frequency IMU datasets paired with a magnetometer are presented.

Abstract

We seek to accelerate research in developing rich, multimodal scene models trained from egocentric data, based on differentiable volumetric ray-tracing inspired by Neural Radiance Fields (NeRFs). The construction of a NeRF-like model from an egocentric image sequence plays a pivotal role in understanding human behavior and holds diverse applications within the realms of VR/AR. Such egocentric NeRF-like models may be used as realistic simulations, contributing significantly to the advancement of intelligent agents capable of executing tasks in the real-world. The future of egocentric view synthesis may lead to novel environment representations going beyond today's NeRFs by augmenting visual data with multimodal sensors such as IMU for egomotion tracking, audio sensors to capture surface texture and human language context, and eye-gaze trackers to infer human attention patterns in the scene. To support and facilitate the development and evaluation of egocentric multimodal scene modeling, we present a comprehensive multimodal egocentric video dataset. This dataset offers a comprehensive collection of sensory data, featuring RGB images, eye-tracking camera footage, audio recordings from a microphone, atmospheric pressure readings from a barometer, positional coordinates from GPS, connectivity details from Wi-Fi and Bluetooth, and information from dual-frequency IMU datasets (1kHz and 800Hz) paired with a magnetometer. The dataset was collected with the Meta Aria Glasses wearable device platform. The diverse data modalities and the real-world context captured within this dataset serve as a robust foundation for furthering our understanding of human behavior and enabling more immersive and intelligent experiences in the realms of VR, AR, and robotics.

Aria-NeRF: Multimodal Egocentric View Synthesis

TL;DR

Abstract

Paper Structure (19 sections, 4 figures, 6 tables)

This paper contains 19 sections, 4 figures, 6 tables.

Introduction
Related Works
One/Few-shot NeRF
Dynamic NeRF
Multimodal NeRF
VR/AR
Method
Nerfacto
Ray Generation and Sampling
Scene Contraction and NeRF Field
NeuralDiff
Dataset and Benchmark
Data Collection
Experiments
Baselines and Implementation Details
...and 4 more sections

Figures (4)

Figure 1: Aria-NeRF Dataset, Kitchen 1 subset, comprises a diverse range of sensory data, including RGB images, ET camera, microphone, barometer, GPS, Wi-Fi, Bluetooth, SLAM, and two sets of IMU data (1kHz and 800Hz), along with a magnetometer.
Figure 2: Scene Examples
Figure 3: Nerfacto Visualization Results on Kitchen 1 subset. We show the step numbers in the rendered video. Nerfacto results reveal some blurred regions, underscoring inherent limitations in its performance.
Figure 4: NeuralDiff Visualization Results on Kitchen 1 subset. NeuralDiff can disentangle the background, dynamic foreground, and actors, all achieved in an unsupervised manner.

Aria-NeRF: Multimodal Egocentric View Synthesis

TL;DR

Abstract

Aria-NeRF: Multimodal Egocentric View Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (4)