Table of Contents
Fetching ...

Emotion Diffusion Classifier with Adaptive Margin Discrepancy Training for Facial Expression Recognition

Rongkang Dong, Cuixin Yang, Cong Zhang, Yushen Zuo, Kin-Man Lam

Abstract

Facial Expression Recognition (FER) is essential for human-machine interaction, as it enables machines to interpret human emotions and internal states from facial affective behaviors. Although deep learning has significantly advanced FER performance, most existing deep-learning-based FER methods rely heavily on discriminative classifiers for fast predictions. These models tend to learn shortcuts and are vulnerable to even minor distribution shifts. To address this issue, we adopt a conditional generative diffusion model and introduce the Emotion Diffusion Classifier (EmoDC) for FER, which demonstrates enhanced adversarial robustness. However, retraining EmoDC using standard strategies fails to penalize incorrect categorical descriptions, leading to suboptimal recognition performance. To improve EmoDC, we propose margin-based discrepancy training, which encourages accurate predictions when conditioned on correct categorical descriptions and penalizes predictions conditioned on mismatched ones. This method enforces a minimum margin between noise-prediction errors for correct and incorrect categories, thereby enhancing the model's discriminative capability. Nevertheless, using a fixed margin fails to account for the varying difficulty of noise prediction across different images, limiting its effectiveness. To overcome this limitation, we propose Adaptive Margin Discrepancy Training (AMDiT), which dynamically adjusts the margin for each sample. Extensive experiments show that AMDiT significantly improves the accuracy of EmoDC over the Base model with standard denoising diffusion training on the RAF-DB basic subset, the RAF-DB compound subset, SFEW-2.0, and AffectNet, in 100-step evaluations. Additionally, EmoDC outperforms state-of-the-art discriminative classifiers in terms of robustness against noise and blur.

Emotion Diffusion Classifier with Adaptive Margin Discrepancy Training for Facial Expression Recognition

Abstract

Facial Expression Recognition (FER) is essential for human-machine interaction, as it enables machines to interpret human emotions and internal states from facial affective behaviors. Although deep learning has significantly advanced FER performance, most existing deep-learning-based FER methods rely heavily on discriminative classifiers for fast predictions. These models tend to learn shortcuts and are vulnerable to even minor distribution shifts. To address this issue, we adopt a conditional generative diffusion model and introduce the Emotion Diffusion Classifier (EmoDC) for FER, which demonstrates enhanced adversarial robustness. However, retraining EmoDC using standard strategies fails to penalize incorrect categorical descriptions, leading to suboptimal recognition performance. To improve EmoDC, we propose margin-based discrepancy training, which encourages accurate predictions when conditioned on correct categorical descriptions and penalizes predictions conditioned on mismatched ones. This method enforces a minimum margin between noise-prediction errors for correct and incorrect categories, thereby enhancing the model's discriminative capability. Nevertheless, using a fixed margin fails to account for the varying difficulty of noise prediction across different images, limiting its effectiveness. To overcome this limitation, we propose Adaptive Margin Discrepancy Training (AMDiT), which dynamically adjusts the margin for each sample. Extensive experiments show that AMDiT significantly improves the accuracy of EmoDC over the Base model with standard denoising diffusion training on the RAF-DB basic subset, the RAF-DB compound subset, SFEW-2.0, and AffectNet, in 100-step evaluations. Additionally, EmoDC outperforms state-of-the-art discriminative classifiers in terms of robustness against noise and blur.

Paper Structure

This paper contains 27 sections, 11 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: An example comparing classifications of discriminative and diffusion classifiers on clean and noisy images.
  • Figure 2: Inference pipeline of the diffusion classifier li2023your for FER. $N_{c}$ denotes the total number of categories.
  • Figure 3: Illustration of: (a) The proposed AMDiT framework, where $d_{*}$ denotes the noise-prediction error $||\epsilon_{\theta}(\bm{x}_{t}, t, c_{*})-\epsilon||_{2}^{2}$ and * represents $p$, $n$, or $\text{\rm nn}$. Specifically, $d_{p}$, $d_{n}$, and $d_{\rm nn}$ represent the noise-prediction errors of a positive image-text pair, a negative image-text pair, and a pair consisting of a facial image and a non-negative text prompt respectively. These prediction errors are used to compute the $\mathcal{L}_{\rm AMDiT}$ training objective (Eq. (\ref{['eq:amd']})). A negative prompt $c_{n}$ for an image is randomly sampled from a pool of negative descriptions (i.e., all categorical descriptions except the positive one). (b) Analysis of training with a fixed margin $m_{f}$. (c) Analysis of training with an adaptive margin $\alpha d_{\rm nn}$. A larger circle radius indicates higher noise-prediction difficulty and therefore a higher error (e.g., a "hard" sample), and vice versa.
  • Figure 4: Correlations of noise-prediction errors between the positive prompt $c_{p}$ and the null prompt $\emptyset$, non-class prompt $c_{\rm nc}$, and negative prompt $c_{n}$. For each sample, the final error for negative prompts is calculated by averaging the errors across all incorrect categorical descriptions. The red line represents a linear reference with a slope of 1. The "Base" model is EmoDC trained using vanilla fine-tuning without negative image-text pairs, with its training objective defined in Equation (\ref{['eq:base']}). Experiments were conducted on 300 samples from RAF-DB_B.
  • Figure 5: Confusion matrices of four EmoDC variants across three datasets in 100-step evaluations. The mean class accuracy (%) is shown above each matrix. For AMDiT, results of $c_{\rm nn}=c_{p}$ are presented.
  • ...and 2 more figures