SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space

Zejian Kang; Kai Zheng; Yuanchen Fei; Wentao Yang; Hongyuan Zou; Xiangru Huang

SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space

Zejian Kang, Kai Zheng, Yuanchen Fei, Wentao Yang, Hongyuan Zou, Xiangru Huang

Abstract

Facial action estimation from a single image is often formulated as predicting or fitting parameters in compact expression spaces, which lack explicit semantic interpretability. However, many practical applications, such as avatar control and human-computer interaction, require interpretable facial actions that correspond to meaningful muscle movements. In this work, we propose \textbf{SemanticFace}, a framework for facial action estimation in the interpretable ARKit blendshape space that reformulates coefficient prediction as structured semantic reasoning. SemanticFace adopts a two-stage semantic distillation paradigm: it first derives structured semantic supervision from ground-truth ARKit coefficients and then distills this knowledge into a multimodal large language model to predict interpretable facial action coefficients from images. Extensive experiments demonstrate that language-aligned semantic supervision improves both coefficient accuracy and perceptual consistency, while enabling strong cross-identity generalization and robustness to large domain shifts, including cartoon faces.

SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space

Abstract

Paper Structure (33 sections, 6 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 33 sections, 6 equations, 7 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Face Modeling
Facial Action Estimation
Method
Problem Formulation
Semantic Supervision Signal Generation
Language-Prior Semantic Distillation
Experiments
Dataset
Data Acquisition.
Dataset Statistics and Diversity.
Data Processing.
Experimental Setting
Evaluation Metrics.
...and 18 more sections

Figures (7)

Figure 1: SemanticFace. Given a facial image, our model jointly outputs structured semantic predictions and interpretable facial action coefficients. This joint formulation bridges visual perception and semantic reasoning, yielding results that are both perceptually natural and semantically grounded, while demonstrating strong robustness across diverse identities and domain shifts (e.g., cartoon characters).
Figure 2: Overview of the SemanticFace framework. In Stage I, a frozen LLM serves as a semantic teacher, converting ground-truth ARKit blendshape coefficients into hierarchical semantic descriptions. In Stage II, these structured language-aligned representations are distilled into a MLLM acting as the student. Through language-prior semantic distillation, the student learns to predict ARKit coefficients from images under structured semantic guidance within an interpretable facial action space.
Figure 3: Emotion Distribution and Visualization. (a) t-SNE projection of scripts colored by emotion intensities. (b) Distribution of dominant emotions across all scripts.
Figure 4: Qualitative Comparison on the Test Set with Ground-Truth ARKit coefficients. From top to bottom: ground truth, DeadFace, and SemanticFace. Compared to the geometric baseline, our method produces facial actions that more closely match the ground-truth coefficients. The white numbers overlaid on each result denote the corresponding MSE values.
Figure 5: In-the-wild qualitative comparison of interpretable facial action estimation. SemanticFace maintains coherent and semantically consistent facial actions across diverse subjects and challenging conditions. The white numbers overlaid on each result denote the corresponding MMD values.
...and 2 more figures

SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space

Abstract

SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space

Authors

Abstract

Table of Contents

Figures (7)