AUGlasses: Continuous Action Unit based Facial Reconstruction with Low-power IMUs on Smart Glasses
Yanrong Li, Tengxiang Zhang, Xin Zeng, Yuntao Wang, Haotian Zhang, Yiqiang Chen
TL;DR
AUGlasses tackles the challenge of continuous, privacy-preserving facial reconstruction on low-power smart glasses by embedding two IMUs near the temporal region to sense skin deformations and using a transformer-based model to estimate 14 facial action unit (AU) intensities at 30 Hz. The system integrates a CNN feature extractor with a six-layer encoder–six-layer decoder that employs prefix-conditioned sequence forecasting to mitigate exposure bias and enhance long-horizon accuracy; AU predictions feed a Unity-based 3D avatar via real-time blendshape mapping. It demonstrates cross-user generalization with AU mean absolute errors around 0.19–0.21 on 14 AUs and a 3D landmark MAE near 1.93 mm, along with micro-benchmarks showing low power consumption (~49.95 mW) and comfortable wearability. These results indicate a practical, privacy-conscious pathway to continuous facial tracking for AR applications, with robust long-term performance and clear avenues for personalization and future material improvements.
Abstract
Recent advancements in augmented reality (AR) have enabled the use of various sensors on smart glasses for applications like facial reconstruction, which is vital to improve AR experiences for virtual social activities. However, the size and power constraints of smart glasses demand a miniature and low-power sensing solution. AUGlasses achieves unobtrusive low-power facial reconstruction by placing inertial measurement units (IMU) against the temporal area on the face to capture the skin deformations, which are caused by facial muscle movements. These IMU signals, along with historical data on facial action units (AUs), are processed by a transformer-based deep learning model to estimate AU intensities in real-time, which are then used for facial reconstruction. Our results show that AUGlasses accurately predicts the strength (0-5 scale) of 14 key AUs with a cross-user mean absolute error (MAE) of 0.187 (STD = 0.025) and achieves facial reconstruction with a cross-user MAE of 1.93 mm (STD = 0.353). We also integrated various preprocessing and training techniques to ensure robust performance for continuous sensing. Micro-benchmark tests indicate that our system consistently performs accurate continuous facial reconstruction with a fine-tuned cross-user model, achieving an AU MAE of 0.35.
