CFCPalsy: Facial Image Synthesis with Cross-Fusion Cycle Diffusion Model for Facial Paralysis Individuals

Weixiang Gao; Yating Zhang; Yifan Xia

CFCPalsy: Facial Image Synthesis with Cross-Fusion Cycle Diffusion Model for Facial Paralysis Individuals

Weixiang Gao, Yating Zhang, Yifan Xia

TL;DR

Facial paralysis diagnosis is hampered by subjective assessment and scarce data. To address this, the authors propose CFCPalsy, a diffusion-based generator that fuses identity, expression, and landmark features to synthesize realistic paralysis images. The model introduces a cross-fusion module with a QKV-based cross-attention mechanism and a cycle diffusion training strategy, incorporating a first loss $L(\theta)$ and a second loss $L(\theta')$ whose combination forms the overall objective. In the diffusion framework, the forward process is defined as $q(X_t|X_{t-1})=\mathcal{N}(X_t;\sqrt{1-\beta_t}X_{t-1},\beta_t I)$ with a denoising objective $L(\theta)=\mathbb{E}_{t,X_t,\epsilon}[\|\epsilon-\epsilon_\theta(X_t,t)\|^2]$, enabling high-fidelity synthesis even with limited data. Experiments on AFLFP and MEEI demonstrate that CFCPalsy outperforms baselines in aFID, PSNR, and SSIM, providing a valuable resource for training and evaluating clinical facial palsy analysis pipelines.

Abstract

Currently, the diagnosis of facial paralysis remains a challenging task, often relying heavily on the subjective judgment and experience of clinicians, which can introduce variability and uncertainty in the assessment process. One promising application in real-life situations is the automatic estimation of facial paralysis. However, the scarcity of facial paralysis datasets limits the development of robust machine learning models for automated diagnosis and therapeutic interventions. To this end, this study aims to synthesize a high-quality facial paralysis dataset to address this gap, enabling more accurate and efficient algorithm training. Specifically, a novel Cross-Fusion Cycle Palsy Expression Generative Model (CFCPalsy) based on the diffusion model is proposed to combine different features of facial information and enhance the visual details of facial appearance and texture in facial regions, thus creating synthetic facial images that accurately represent various degrees and types of facial paralysis. We have qualitatively and quantitatively evaluated the proposed method on the commonly used public clinical datasets of facial paralysis to demonstrate its effectiveness. Experimental results indicate that the proposed method surpasses state-of-the-art methods, generating more realistic facial images and maintaining identity consistency.

CFCPalsy: Facial Image Synthesis with Cross-Fusion Cycle Diffusion Model for Facial Paralysis Individuals

TL;DR

and a second loss

whose combination forms the overall objective. In the diffusion framework, the forward process is defined as

with a denoising objective

, enabling high-fidelity synthesis even with limited data. Experiments on AFLFP and MEEI demonstrate that CFCPalsy outperforms baselines in aFID, PSNR, and SSIM, providing a valuable resource for training and evaluating clinical facial palsy analysis pipelines.

Abstract

Paper Structure (24 sections, 11 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 24 sections, 11 equations, 6 figures, 1 table, 1 algorithm.

Introduction
Related Work
Face Synthesis
Diffusion Models
Method
Preliminary
Feature Extraction Module
ID Extraction Module
Expression Extraction Module
Landmark Extractor
Cross-Fusion Module
Cycle Diffusion Strategy
Experiments
Datasets
AFLFP
...and 9 more sections

Figures (6)

Figure 1: Facial palsy synthesis results. Each row contains images from the same individual, and each column shows images with the same facial palsy expression.
Figure 2: An overview of the forward process of CFCPalsy. a) The architecture of our feature extractors. We train the model using different images of the same patient, with one serving as the identity image $X_{ID}$ and the other as the facial palsy expression image $X_{0}$. The identity model extracts identity features from the identity image. The expression module (including $E_{Face}$ and $E_{ID}$) and $E_{LM}$ extract expression and landmark features from the facial palsy expression image, respectively. b) An illustration of the principle of the cross feature fusion strategy. We employ a method similar to cross-attention to facilitate information exchange among the three conditional features. After this interaction, the features are concatenated for further processing. c) A diagram of the noise predicting network. CFCPalsy utilizes a classic U-Net architecture combined with residual connections and cross-attention mechanisms to accurately predict the noise.
Figure 3: An illustration of the cycle training strategy of CFCPalsy. Each training data undergoes two diffusion processes, where the plus sign indicates the noise addition process.
Figure 4: A schematic diagram of the aFID calculation method. Calculate the FID value between the twice-synthesized data and the original data.
Figure 5: Visualized experimental results. CFCPalsy is the standard version, CFCPalsy$^1$ is the CFCPalsy without facial landmarks, CFCPalsy$^2$ excludes the cycle training strategy and CFCPalsy$^3$ operates in the absence of the cross-fusion module.
...and 1 more figures

CFCPalsy: Facial Image Synthesis with Cross-Fusion Cycle Diffusion Model for Facial Paralysis Individuals

TL;DR

Abstract

CFCPalsy: Facial Image Synthesis with Cross-Fusion Cycle Diffusion Model for Facial Paralysis Individuals

Authors

TL;DR

Abstract

Table of Contents

Figures (6)