Table of Contents
Fetching ...

SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space

Zejian Kang, Kai Zheng, Yuanchen Fei, Wentao Yang, Hongyuan Zou, Xiangru Huang

Abstract

Facial action estimation from a single image is often formulated as predicting or fitting parameters in compact expression spaces, which lack explicit semantic interpretability. However, many practical applications, such as avatar control and human-computer interaction, require interpretable facial actions that correspond to meaningful muscle movements. In this work, we propose \textbf{SemanticFace}, a framework for facial action estimation in the interpretable ARKit blendshape space that reformulates coefficient prediction as structured semantic reasoning. SemanticFace adopts a two-stage semantic distillation paradigm: it first derives structured semantic supervision from ground-truth ARKit coefficients and then distills this knowledge into a multimodal large language model to predict interpretable facial action coefficients from images. Extensive experiments demonstrate that language-aligned semantic supervision improves both coefficient accuracy and perceptual consistency, while enabling strong cross-identity generalization and robustness to large domain shifts, including cartoon faces.

SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space

Abstract

Facial action estimation from a single image is often formulated as predicting or fitting parameters in compact expression spaces, which lack explicit semantic interpretability. However, many practical applications, such as avatar control and human-computer interaction, require interpretable facial actions that correspond to meaningful muscle movements. In this work, we propose \textbf{SemanticFace}, a framework for facial action estimation in the interpretable ARKit blendshape space that reformulates coefficient prediction as structured semantic reasoning. SemanticFace adopts a two-stage semantic distillation paradigm: it first derives structured semantic supervision from ground-truth ARKit coefficients and then distills this knowledge into a multimodal large language model to predict interpretable facial action coefficients from images. Extensive experiments demonstrate that language-aligned semantic supervision improves both coefficient accuracy and perceptual consistency, while enabling strong cross-identity generalization and robustness to large domain shifts, including cartoon faces.
Paper Structure (33 sections, 6 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 33 sections, 6 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: SemanticFace. Given a facial image, our model jointly outputs structured semantic predictions and interpretable facial action coefficients. This joint formulation bridges visual perception and semantic reasoning, yielding results that are both perceptually natural and semantically grounded, while demonstrating strong robustness across diverse identities and domain shifts (e.g., cartoon characters).
  • Figure 2: Overview of the SemanticFace framework. In Stage I, a frozen LLM serves as a semantic teacher, converting ground-truth ARKit blendshape coefficients into hierarchical semantic descriptions. In Stage II, these structured language-aligned representations are distilled into a MLLM acting as the student. Through language-prior semantic distillation, the student learns to predict ARKit coefficients from images under structured semantic guidance within an interpretable facial action space.
  • Figure 3: Emotion Distribution and Visualization. (a) t-SNE projection of scripts colored by emotion intensities. (b) Distribution of dominant emotions across all scripts.
  • Figure 4: Qualitative Comparison on the Test Set with Ground-Truth ARKit coefficients. From top to bottom: ground truth, DeadFace, and SemanticFace. Compared to the geometric baseline, our method produces facial actions that more closely match the ground-truth coefficients. The white numbers overlaid on each result denote the corresponding MSE values.
  • Figure 5: In-the-wild qualitative comparison of interpretable facial action estimation. SemanticFace maintains coherent and semantically consistent facial actions across diverse subjects and challenging conditions. The white numbers overlaid on each result denote the corresponding MMD values.
  • ...and 2 more figures