LRDif: Diffusion Models for Under-Display Camera Emotion Recognition

Zhifeng Wang; Kaihao Zhang; Ramesh Sankaranarayana

LRDif: Diffusion Models for Under-Display Camera Emotion Recognition

Zhifeng Wang, Kaihao Zhang, Ramesh Sankaranarayana

TL;DR

LRDif tackles FER under under-display camera (UDC) degradation by marrying a two-stage training framework with diffusion-based label restoration. The first stage builds a compact emotion prior representation (EPR) $Z$ via FPEN_S1 to guide UDCformer, while the second stage uses a diffusion model to estimate $Z$ directly from degraded UDC images, enabling robust emotion prediction. The approach combines a Dynamic UDC transformer (UDCformer) with a Dynamic Image and Landmarks Network (DILnetwork) for multi-scale feature fusion, and optimizes a total loss $\,\mathcal{L}_{total} = \,\mathcal{L}_{ce} + \,\mathcal{L}_{kl}$ that fuses cross-entropy with KL-based EPR regularization. Empirically, LRDif achieves state-of-the-art or competitive results on standard FER datasets (RAF-DB, FERPlus, KDEF) and their UDC variants (UDC-RAF-DB, UDC-FERPlus, UDC-KDEF), highlighting the practical impact for robust FER in devices with UDC hardware.

Abstract

This study introduces LRDif, a novel diffusion-based framework designed specifically for facial expression recognition (FER) within the context of under-display cameras (UDC). To address the inherent challenges posed by UDC's image degradation, such as reduced sharpness and increased noise, LRDif employs a two-stage training strategy that integrates a condensed preliminary extraction network (FPEN) and an agile transformer network (UDCformer) to effectively identify emotion labels from UDC images. By harnessing the robust distribution mapping capabilities of Diffusion Models (DMs) and the spatial dependency modeling strength of transformers, LRDif effectively overcomes the obstacles of noise and distortion inherent in UDC environments. Comprehensive experiments on standard FER datasets including RAF-DB, KDEF, and FERPlus, LRDif demonstrate state-of-the-art performance, underscoring its potential in advancing FER applications. This work not only addresses a significant gap in the literature by tackling the UDC challenge in FER but also sets a new benchmark for future research in the field.

LRDif: Diffusion Models for Under-Display Camera Emotion Recognition

TL;DR

via FPEN_S1 to guide UDCformer, while the second stage uses a diffusion model to estimate

directly from degraded UDC images, enabling robust emotion prediction. The approach combines a Dynamic UDC transformer (UDCformer) with a Dynamic Image and Landmarks Network (DILnetwork) for multi-scale feature fusion, and optimizes a total loss

that fuses cross-entropy with KL-based EPR regularization. Empirically, LRDif achieves state-of-the-art or competitive results on standard FER datasets (RAF-DB, FERPlus, KDEF) and their UDC variants (UDC-RAF-DB, UDC-FERPlus, UDC-KDEF), highlighting the practical impact for robust FER in devices with UDC hardware.

Abstract

Paper Structure (19 sections, 13 equations, 7 figures, 6 tables, 2 algorithms)

This paper contains 19 sections, 13 equations, 7 figures, 6 tables, 2 algorithms.

Introduction
Related Work
Facial Expression Recognition
Diffusion Models
Methods
Pretrain DTnetwork
Dynamic Image and Landmarks Network (DILnetwork)
Diffusion Models for Label Restoration
Experiments
Datasets
Implementation Details
Comparison with SOTA FER Methods
Comparison with Typical FER-model
Comparison with the UDC FER-model
FLOPs and Param Comparison
...and 4 more sections

Figures (7)

Figure 1: The image appears to present a comparison between two types of camera images with their respective color histograms. (a) shows an image taken with an under-display camera (UDC), which looks less clear compared to a regular camera. It's a bit blurry, has more grain or "noise," and the colors aren't as true to life. (b) shows an image taken with a traditional external camera, which is much clearer. The little girl's face is sharp, with well-defined features and colors that look more natural.
Figure 2: The overview of the proposed LRDif, which consists of UDCformer, FPEN and denoising network. LRDif has two training stages:(a) In the first stage, $FPEN_{S_1}$ takes the ground-truth label and UDC image as input and outputs an EPR Z to guide UDCformer to restore labels. We optimize the $FPEN_{S_1}$ with LRDif$_{S_1}$ together to make LRDif$_{S_1}$ can fully use extracted EPR Z. (b)In the second stage, we use the strong data estimation of the PDDM to estimate the EPR extracted by pretrained $FPEN_{S_1}$. Notably, we do not input the ground-truth label into $FPEN_{S_2}$ and denoising networks. In the inference stage, we only use the reverse process of PDDM.
Figure 3: The overview of DTNet, which consists of DGNet and DMNet.
Figure 4: The learned feature distribution by SCN and LRDif training on the RAF-DB datasets.
Figure 5: The learned feature distribution by SCN and LRDif training on the RAF-DB datasets.
...and 2 more figures

LRDif: Diffusion Models for Under-Display Camera Emotion Recognition

TL;DR

Abstract

LRDif: Diffusion Models for Under-Display Camera Emotion Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (7)