Table of Contents
Fetching ...

Mind-to-Face: Neural-Driven Photorealistic Avatar Synthesis via EEG Decoding

Haolin Xiong, Tianwen Fu, Pratusha Bhuvana Prasad, Yunxuan Cai, Haiwei Chen, Wenbin Teng, Hanyuan Xiao, Yajie Zhao

TL;DR

Mind-to-Face introduces the first framework that converts non-invasive EEG signals into photorealistic 3D facial expressions by mapping EEG windows to dense 3D position maps and rendering them with 3D Gaussian Splatting. The approach relies on a synchronized dual-modality dataset of 16-channel EEG and high-speed multi-view video, using a CNN–Transformer encoder to produce dense geometry supervised by photogrammetric ground truth. It emphasizes personalized neural-to-expression mappings to capture subject-specific dynamics and demonstrates high-fidelity, view-consistent avatars even under occlusion. This work unlocks emotion-aware telepresence and cognitive interaction by leveraging neural activity to drive realistic facial synthesis without visible facial recordings.

Abstract

Current expressive avatar systems rely heavily on visual cues, failing when faces are occluded or when emotions remain internal. We present Mind-to-Face, the first framework that decodes non-invasive electroencephalogram (EEG) signals directly into high-fidelity facial expressions. We build a dual-modality recording setup to obtain synchronized EEG and multi-view facial video during emotion-eliciting stimuli, enabling precise supervision for neural-to-visual learning. Our model uses a CNN-Transformer encoder to map EEG signals into dense 3D position maps, capable of sampling over 65k vertices, capturing fine-scale geometry and subtle emotional dynamics, and renders them through a modified 3D Gaussian Splatting pipeline for photorealistic, view-consistent results. Through extensive evaluation, we show that EEG alone can reliably predict dynamic, subject-specific facial expressions, including subtle emotional responses, demonstrating that neural signals contain far richer affective and geometric information than previously assumed. Mind-to-Face establishes a new paradigm for neural-driven avatars, enabling personalized, emotion-aware telepresence and cognitive interaction in immersive environments.

Mind-to-Face: Neural-Driven Photorealistic Avatar Synthesis via EEG Decoding

TL;DR

Mind-to-Face introduces the first framework that converts non-invasive EEG signals into photorealistic 3D facial expressions by mapping EEG windows to dense 3D position maps and rendering them with 3D Gaussian Splatting. The approach relies on a synchronized dual-modality dataset of 16-channel EEG and high-speed multi-view video, using a CNN–Transformer encoder to produce dense geometry supervised by photogrammetric ground truth. It emphasizes personalized neural-to-expression mappings to capture subject-specific dynamics and demonstrates high-fidelity, view-consistent avatars even under occlusion. This work unlocks emotion-aware telepresence and cognitive interaction by leveraging neural activity to drive realistic facial synthesis without visible facial recordings.

Abstract

Current expressive avatar systems rely heavily on visual cues, failing when faces are occluded or when emotions remain internal. We present Mind-to-Face, the first framework that decodes non-invasive electroencephalogram (EEG) signals directly into high-fidelity facial expressions. We build a dual-modality recording setup to obtain synchronized EEG and multi-view facial video during emotion-eliciting stimuli, enabling precise supervision for neural-to-visual learning. Our model uses a CNN-Transformer encoder to map EEG signals into dense 3D position maps, capable of sampling over 65k vertices, capturing fine-scale geometry and subtle emotional dynamics, and renders them through a modified 3D Gaussian Splatting pipeline for photorealistic, view-consistent results. Through extensive evaluation, we show that EEG alone can reliably predict dynamic, subject-specific facial expressions, including subtle emotional responses, demonstrating that neural signals contain far richer affective and geometric information than previously assumed. Mind-to-Face establishes a new paradigm for neural-driven avatars, enabling personalized, emotion-aware telepresence and cognitive interaction in immersive environments.

Paper Structure

This paper contains 32 sections, 11 equations, 11 figures, 1 table, 1 algorithm.

Figures (11)

  • Figure 1: Overview of Mind-to-Face, our neural-driven avatar framework. (a) We record synchronized EEG and facial expressions while subjects view emotion-eliciting stimuli, then decode the EEG into dense 3D position maps, and render photorealistic avatars using 3D Gaussian Splatting kerbl20233dgaussiansplattingrealtime. (b) Although AR/VR is not the primary focus of this work, a potential use case is shown: EEG can be decoded to drive an expressive avatar even when the face is fully occluded by a head-mounted display.
  • Figure 2: Overview of Our Mind-to-Face Pipeline. Our system decodes raw EEG signals into dense 3D position maps that are rendered into photorealistic avatars. Data are collected using a custom multi-view capture rig composed of synchronized high-speed RGB cameras and a 16-channel EEG headset, with all modalities temporally aligned via Linear Timecode for frame-accurate correspondence between EEG and facial expressions. Multi-view videos are reconstructed into ground-truth 3D facial meshes through photogrammetry. During training, EEG slices are encoded using a CNN–Transformer encoder (EEG-Conformer song2023eeg) and decoded into 3D position maps using a Stable Diffusion 2.1 image decoder rombach2021highresolution, supervised with MSE loss against photogrammetric position maps. At inference, the predicted position maps are resampled into meshes and rendered using a modified GaussianAvatars pipeline qian2024gaussianavatarsphotorealisticheadavatars for high-fidelity avatar synthesis.
  • Figure 3: Trial Design. The experiment consists of five emotion-specific trials, each separated by a 30-second rest period. Within each trial, participants view a sequence of video clips conveying a single, consistent emotion corresponding to that trial’s theme.
  • Figure 4: Illustration of the Gaussian Binding Strategy. Each ellipsoidal Gaussian splat is defined by its local rotation $\mathbf{r}$, position $\mathbf{\mu}$, and scale $\mathbf{s}$ relative to the face center with global transform $\mathbf{T}$ and $\mathbf{R}$. During animation, $\mathbf{T}$ and $\mathbf{R}$ are updated over time according to the predicted facial geometry, while $\mathbf{r}$, $\mathbf{\mu}$, and $\mathbf{s}$ remain fixed as learnable parameters. The mesh is decimated for clarity in visualization.
  • Figure 5: Qualitative results of EEG-driven facial expression synthesis from testing set. For three stimulus categories (Sad, Funny, Disgust), we show the emotion-eliciting video frames (top), the captured facial expressions from two subjects (middle), and the corresponding expressions produced by our EEG-driven avatar (bottom). The reconstructed avatars reflect subject-specific emotional responses and capture both neutral and strongly expressive frames. More qualitative results and video demos can be found in Supplementary Materials.
  • ...and 6 more figures