Table of Contents
Fetching ...

Headset: Human emotion awareness under partial occlusions multimodal dataset

Fatemeh Ghorbani Lohesara, Davi Rabbouni Freitas, Christine Guillemot, Karen Eguiazarian, Sebastian Knorr

TL;DR

HEADSET delivers a multimodal volumetric dataset designed for immersive XR applications, capturing 27 participants with posed expressions and 11 wearing HMDs using a VoCap studio and a Lytro LF camera. The dataset includes textured meshes, colored point clouds, multi-view RGB-D, and light-field data, along with ground-truth labels for facial expressions and occlusion scenarios, all supported by post-processed data and calibration metadata. It enables evaluation in facial expression classification, HMD removal via deep inpainting, and 3D compression, with initial results showing competitive FEC performance and improved 3D quality after post-processing. By providing extensive raw and processed data, HEADSET seeks to accelerate research in emotion-aware XR rendering, HMD-in-the-loop reconstruction, and volumetric video streaming, with public availability for benchmarking and method development.

Abstract

The volumetric representation of human interactions is one of the fundamental domains in the development of immersive media productions and telecommunication applications. Particularly in the context of the rapid advancement of Extended Reality (XR) applications, this volumetric data has proven to be an essential technology for future XR elaboration. In this work, we present a new multimodal database to help advance the development of immersive technologies. Our proposed database provides ethically compliant and diverse volumetric data, in particular 27 participants displaying posed facial expressions and subtle body movements while speaking, plus 11 participants wearing head-mounted displays (HMDs). The recording system consists of a volumetric capture (VoCap) studio, including 31 synchronized modules with 62 RGB cameras and 31 depth cameras. In addition to textured meshes, point clouds, and multi-view RGB-D data, we use one Lytro Illum camera for providing light field (LF) data simultaneously. Finally, we also provide an evaluation of our dataset employment with regard to the tasks of facial expression classification, HMDs removal, and point cloud reconstruction. The dataset can be helpful in the evaluation and performance testing of various XR algorithms, including but not limited to facial expression recognition and reconstruction, facial reenactment, and volumetric video. HEADSET and its all associated raw data and license agreement will be publicly available for research purposes.

Headset: Human emotion awareness under partial occlusions multimodal dataset

TL;DR

HEADSET delivers a multimodal volumetric dataset designed for immersive XR applications, capturing 27 participants with posed expressions and 11 wearing HMDs using a VoCap studio and a Lytro LF camera. The dataset includes textured meshes, colored point clouds, multi-view RGB-D, and light-field data, along with ground-truth labels for facial expressions and occlusion scenarios, all supported by post-processed data and calibration metadata. It enables evaluation in facial expression classification, HMD removal via deep inpainting, and 3D compression, with initial results showing competitive FEC performance and improved 3D quality after post-processing. By providing extensive raw and processed data, HEADSET seeks to accelerate research in emotion-aware XR rendering, HMD-in-the-loop reconstruction, and volumetric video streaming, with public availability for benchmarking and method development.

Abstract

The volumetric representation of human interactions is one of the fundamental domains in the development of immersive media productions and telecommunication applications. Particularly in the context of the rapid advancement of Extended Reality (XR) applications, this volumetric data has proven to be an essential technology for future XR elaboration. In this work, we present a new multimodal database to help advance the development of immersive technologies. Our proposed database provides ethically compliant and diverse volumetric data, in particular 27 participants displaying posed facial expressions and subtle body movements while speaking, plus 11 participants wearing head-mounted displays (HMDs). The recording system consists of a volumetric capture (VoCap) studio, including 31 synchronized modules with 62 RGB cameras and 31 depth cameras. In addition to textured meshes, point clouds, and multi-view RGB-D data, we use one Lytro Illum camera for providing light field (LF) data simultaneously. Finally, we also provide an evaluation of our dataset employment with regard to the tasks of facial expression classification, HMDs removal, and point cloud reconstruction. The dataset can be helpful in the evaluation and performance testing of various XR algorithms, including but not limited to facial expression recognition and reconstruction, facial reenactment, and volumetric video. HEADSET and its all associated raw data and license agreement will be publicly available for research purposes.
Paper Structure (28 sections, 8 figures, 4 tables)

This paper contains 28 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The complete capturing setup for the data collection. In addition to the VoCap studio, we also used one Lytro Illum camera, one display, and a microphone.
  • Figure 2: Example of reconstructed textured meshes. (A): full-body 3D model of a participant, (B): RGB image captured by camera number 30, (C): full-body 3D model with HMD occlusion, and (D): RGB image captured by camera number 16.
  • Figure 3: Example of colored point clouds of a participant wearing glasses in task A. (A): RGB image captured by camera number 30, point cloud representation from (B): raw data, (C): post-processed, and (D): sampled from textured meshes.
  • Figure 4: Sample of RGB images and depth maps from three views. (A,D): RGB-D image of module number 30, (B,E): RGB-D image of module number 1, (C,F): RGB-D image of module number 16.
  • Figure 5: Three synchronized views (two non-frontal views in HEADSET-VoCap, and one frontal view in HEADSET-LF) of detected faces showing a "Happiness" expression.
  • ...and 3 more figures