Table of Contents
Fetching ...

PixelSmile: Toward Fine-Grained Facial Expression Editing

Jiabin Hua, Hengyuan Xu, Aojie Li, Wei Cheng, Gang Yu, Xingjun Ma, Yu-Gang Jiang

Abstract

Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.

PixelSmile: Toward Fine-Grained Facial Expression Editing

Abstract

Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.

Paper Structure

This paper contains 34 sections, 11 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Overview of PixelSmile. It enables 1) continuous and precise control of facial expression intensity across real-world and anime domains, 2) editing across 12 distinct expression categories, and 3) seamless blending of multiple expressions.
  • Figure 2: Observation of Expression Semantic Overlap. Inherent expression overlap causes systematic confusion across human annotators, recognition models, and generative models (top). We resolve this via the FFE dataset (bottom left) and PixelSmile framework (bottom right), utilizing continuous supervision and symmetric training for disentangled editing.
  • Figure 3: Framework Overview. (1) Inference Stage. We interpolate between the neutral and target expression embeddings in textual latent space using a controllable coefficient $\alpha$, enabling continuous adjustment of expression intensity. (2) Training Stage. We adopt a joint fully symmetric training framework. Specifically, we sample a source image $P_{\mathrm{src}}$ and a confusing expression pair $(P_a, P_b)$ to construct a triplet. We first treat $P_a$ as the positive and $P_b$ as the negative to compute a joint loss, and then swap their roles to compute it again, yielding a symmetric training objective. The joint loss consists of three components: a Flow-Matching loss for intensity alignment, a contrastive loss for expression separation, and an identity preservation loss to maintain subject consistency.
  • Figure 4: Quantitative Evaluation of Linear Control Methods. Comparison of the trade-off between ID similarity and expression score across different models. PixelSmile achieves an optimal balance, providing a wider expression manipulation range while preserving identity fidelity.
  • Figure 5: Qualitative Comparison with General Editing Models. PixelSmile produces clearer expression changes while preserving facial identity, whereas existing editing models either weaken expression editing or degrade identity consistency.
  • ...and 9 more figures