LLDif: Diffusion Models for Low-light Emotion Recognition
Zhifeng Wang, Kaihao Zhang, Ramesh Sankaranarayana
TL;DR
LLDif tackles facial expression recognition in extremely low-light (LL) conditions by learning a compact Embedding Prior Distribution (EPD) that couples visual features with emotion labels. It employs a two-stage diffusion-based framework: stage-1 uses Label-aware CLIP (LA-CLIP) to generate a joint EPD $Z$ that guides label restoration in a Low-Light transformer (LLformer), and stage-2 uses a diffusion model to directly infer a refined EPD from LL images, enabling accurate predictions with few diffusion steps. The architecture combinação includes LA-CLIP, PNET, LLformer, and a diffusion-based label restoration pathway with cross-window attention in DLNet and transformer-based fusion, trained with a joint loss $\, \mathcal{L}_{total} = \mathcal{L}_{ce} + \mathcal{L}_{kl}$ where $\mathcal{L}_{kl}$ aligns the LA-CLIP and stage-2 EPDs. Experimental results across LL-RAF-DB, LL-FERPlus, and LL-KDEF show competitive or superior accuracy compared to SOTA methods, demonstrating the practical potential of diffusion priors for robust LL FER in real-world applications.
Abstract
This paper introduces LLDif, a novel diffusion-based facial expression recognition (FER) framework tailored for extremely low-light (LL) environments. Images captured under such conditions often suffer from low brightness and significantly reduced contrast, presenting challenges to conventional methods. These challenges include poor image quality that can significantly reduce the accuracy of emotion recognition. LLDif addresses these issues with a novel two-stage training process that combines a Label-aware CLIP (LA-CLIP), an embedding prior network (PNET), and a transformer-based network adept at handling the noise of low-light images. The first stage involves LA-CLIP generating a joint embedding prior distribution (EPD) to guide the LLformer in label recovery. In the second stage, the diffusion model (DM) refines the EPD inference, ultilising the compactness of EPD for precise predictions. Experimental evaluations on various LL-FER datasets have shown that LLDif achieves competitive performance, underscoring its potential to enhance FER applications in challenging lighting conditions.
