Table of Contents
Fetching ...

LLDif: Diffusion Models for Low-light Emotion Recognition

Zhifeng Wang, Kaihao Zhang, Ramesh Sankaranarayana

TL;DR

LLDif tackles facial expression recognition in extremely low-light (LL) conditions by learning a compact Embedding Prior Distribution (EPD) that couples visual features with emotion labels. It employs a two-stage diffusion-based framework: stage-1 uses Label-aware CLIP (LA-CLIP) to generate a joint EPD $Z$ that guides label restoration in a Low-Light transformer (LLformer), and stage-2 uses a diffusion model to directly infer a refined EPD from LL images, enabling accurate predictions with few diffusion steps. The architecture combinação includes LA-CLIP, PNET, LLformer, and a diffusion-based label restoration pathway with cross-window attention in DLNet and transformer-based fusion, trained with a joint loss $\, \mathcal{L}_{total} = \mathcal{L}_{ce} + \mathcal{L}_{kl}$ where $\mathcal{L}_{kl}$ aligns the LA-CLIP and stage-2 EPDs. Experimental results across LL-RAF-DB, LL-FERPlus, and LL-KDEF show competitive or superior accuracy compared to SOTA methods, demonstrating the practical potential of diffusion priors for robust LL FER in real-world applications.

Abstract

This paper introduces LLDif, a novel diffusion-based facial expression recognition (FER) framework tailored for extremely low-light (LL) environments. Images captured under such conditions often suffer from low brightness and significantly reduced contrast, presenting challenges to conventional methods. These challenges include poor image quality that can significantly reduce the accuracy of emotion recognition. LLDif addresses these issues with a novel two-stage training process that combines a Label-aware CLIP (LA-CLIP), an embedding prior network (PNET), and a transformer-based network adept at handling the noise of low-light images. The first stage involves LA-CLIP generating a joint embedding prior distribution (EPD) to guide the LLformer in label recovery. In the second stage, the diffusion model (DM) refines the EPD inference, ultilising the compactness of EPD for precise predictions. Experimental evaluations on various LL-FER datasets have shown that LLDif achieves competitive performance, underscoring its potential to enhance FER applications in challenging lighting conditions.

LLDif: Diffusion Models for Low-light Emotion Recognition

TL;DR

LLDif tackles facial expression recognition in extremely low-light (LL) conditions by learning a compact Embedding Prior Distribution (EPD) that couples visual features with emotion labels. It employs a two-stage diffusion-based framework: stage-1 uses Label-aware CLIP (LA-CLIP) to generate a joint EPD that guides label restoration in a Low-Light transformer (LLformer), and stage-2 uses a diffusion model to directly infer a refined EPD from LL images, enabling accurate predictions with few diffusion steps. The architecture combinação includes LA-CLIP, PNET, LLformer, and a diffusion-based label restoration pathway with cross-window attention in DLNet and transformer-based fusion, trained with a joint loss where aligns the LA-CLIP and stage-2 EPDs. Experimental results across LL-RAF-DB, LL-FERPlus, and LL-KDEF show competitive or superior accuracy compared to SOTA methods, demonstrating the practical potential of diffusion priors for robust LL FER in real-world applications.

Abstract

This paper introduces LLDif, a novel diffusion-based facial expression recognition (FER) framework tailored for extremely low-light (LL) environments. Images captured under such conditions often suffer from low brightness and significantly reduced contrast, presenting challenges to conventional methods. These challenges include poor image quality that can significantly reduce the accuracy of emotion recognition. LLDif addresses these issues with a novel two-stage training process that combines a Label-aware CLIP (LA-CLIP), an embedding prior network (PNET), and a transformer-based network adept at handling the noise of low-light images. The first stage involves LA-CLIP generating a joint embedding prior distribution (EPD) to guide the LLformer in label recovery. In the second stage, the diffusion model (DM) refines the EPD inference, ultilising the compactness of EPD for precise predictions. Experimental evaluations on various LL-FER datasets have shown that LLDif achieves competitive performance, underscoring its potential to enhance FER applications in challenging lighting conditions.
Paper Structure (12 sections, 13 equations, 8 figures, 5 tables)

This paper contains 12 sections, 13 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Top: the low-light image (LL) shows a child's face that is shadowed and details are obscured, making it challenging to discern fine facial expressions. Bottom: In the normal-light image (CI) at the bottom, the child's face is clearly visible with good detail, essential for recognizing emotions.
  • Figure 2: The proposed LLDif framework, comprising Label-aware CLIP (LA-CLIP), LLformer, PNET, and a denoising network. LLDif employs a two-stage training method: (1) Initially, we apply LA-CLIP to process the low-light image alongside its image caption and label, producing a Joint Embedding Prior Distribution (EPD) Z. This EPD is then used to instruct the LLformer in label restoration. (2) During the second stage, the diffusion model (DM) undergoes training to directly deduce the precise Embedding Prior Distribution (EPD) from images captured in low-light conditions.
  • Figure 3: The overview of DTNet, which consists of DGNet and DMNet.
  • Figure 4: Emotion distribution for samples in LL-RAF-DB dataset and RAF-DB.
  • Figure 5: The predicted feature visualised by t-SNE between our method and SCN.
  • ...and 3 more figures