Table of Contents
Fetching ...

Capturing the Unseen: Vision-Free Facial Motion Capture Using Inertial Measurement Units

Youjia Wang, Yiwen Wu, Hengan Zhou, Hongyang Lin, Xingyue Peng, Jingyan Zhang, Yingsheng Zhu, Yingwenqi Jiang, Yatu Zhang, Lan Xu, Jingya Wang, Jingyi Yu

TL;DR

CAPUS addresses the challenge of facial motion capture without visual input by using a set of lightweight, anatomy-aligned IMUs placed on facial muscles. It introduces a new IMU design, the first facial IMU dataset aligned with ARKit, and a Transformer Diffusion-based model that maps IMU signals to 53 blendshape parameters. The results show CAPUS can reliably reconstruct facial expressions in occluded, low-light, and mobile scenarios and offers strong privacy advantages over vision-based methods. This work suggests that IMU-based facial MoCap can reach performance comparable to visual approaches while enabling camera-free, privacy-preserving capture in challenging environments.

Abstract

We present Capturing the Unseen (CAPUS), a novel facial motion capture (MoCap) technique that operates without visual signals. CAPUS leverages miniaturized Inertial Measurement Units (IMUs) as a new sensing modality for facial motion capture. While IMUs have become essential in full-body MoCap for their portability and independence from environmental conditions, their application in facial MoCap remains underexplored. We address this by customizing micro-IMUs, small enough to be placed on the face, and strategically positioning them in alignment with key facial muscles to capture expression dynamics. CAPUS introduces the first facial IMU dataset, encompassing both IMU and visual signals from participants engaged in diverse activities such as multilingual speech, facial expressions, and emotionally intoned auditions. We train a Transformer Diffusion-based neural network to infer Blendshape parameters directly from IMU data. Our experimental results demonstrate that CAPUS reliably captures facial motion in conditions where visual-based methods struggle, including facial occlusions, rapid movements, and low-light environments. Additionally, by eliminating the need for visual inputs, CAPUS offers enhanced privacy protection, making it a robust solution for vision-free facial MoCap.

Capturing the Unseen: Vision-Free Facial Motion Capture Using Inertial Measurement Units

TL;DR

CAPUS addresses the challenge of facial motion capture without visual input by using a set of lightweight, anatomy-aligned IMUs placed on facial muscles. It introduces a new IMU design, the first facial IMU dataset aligned with ARKit, and a Transformer Diffusion-based model that maps IMU signals to 53 blendshape parameters. The results show CAPUS can reliably reconstruct facial expressions in occluded, low-light, and mobile scenarios and offers strong privacy advantages over vision-based methods. This work suggests that IMU-based facial MoCap can reach performance comparable to visual approaches while enabling camera-free, privacy-preserving capture in challenging environments.

Abstract

We present Capturing the Unseen (CAPUS), a novel facial motion capture (MoCap) technique that operates without visual signals. CAPUS leverages miniaturized Inertial Measurement Units (IMUs) as a new sensing modality for facial motion capture. While IMUs have become essential in full-body MoCap for their portability and independence from environmental conditions, their application in facial MoCap remains underexplored. We address this by customizing micro-IMUs, small enough to be placed on the face, and strategically positioning them in alignment with key facial muscles to capture expression dynamics. CAPUS introduces the first facial IMU dataset, encompassing both IMU and visual signals from participants engaged in diverse activities such as multilingual speech, facial expressions, and emotionally intoned auditions. We train a Transformer Diffusion-based neural network to infer Blendshape parameters directly from IMU data. Our experimental results demonstrate that CAPUS reliably captures facial motion in conditions where visual-based methods struggle, including facial occlusions, rapid movements, and low-light environments. Additionally, by eliminating the need for visual inputs, CAPUS offers enhanced privacy protection, making it a robust solution for vision-free facial MoCap.
Paper Structure (15 sections, 3 equations, 8 figures, 2 tables)

This paper contains 15 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: We introduce CAPUS, an innovative facial capture system based on IMUs. Using flexible electronic materials, we fabricate miniature IMUs that attach to the human face. Without relying on any visual signals, CAPUS can accurately reconstruct facial expressions.
  • Figure 2: Our IMU has two main components: the face unit and the primary unit. Top: size comparison. Bottom: architecture design.
  • Figure 3: Our transformer diffusion network architecture. We use IMU signal $C$ as a condition input to the network. In each iteration, the network denoises $x^t$, and finally outputs the predicted blendshape parameters $x^0$.
  • Figure 4: Gallery. We present three subjects, with each row corresponding to two different expressions of a single participant. For each subfigure, Left: Image reference. Middle: Facial motion reconstructed by our pipeline. Right: Recorded result by ARKitapple2023. Our method achieves results that are comparable to those obtained using ARKit.
  • Figure 5: Experiment on IMU placement on the face. This figure presents our anatomically-based facial partitioning, highlighting the selected points and the corresponding experiments conducted for each facial region. The left image shows our chosen points on the face, while the other images elaborate on the individual experiments conducted for each specific area. The upper section presents a distribution map of the test points allocated to each region, the middle section identifies the primary expressions and movements associated with that area, and the lower section exhibits the acceleration curves of the IMUs situated at each designated point.
  • ...and 3 more figures