Table of Contents
Fetching ...

AffectSRNet : Facial Emotion-Aware Super-Resolution Network

Syed Sameen Ahmad Rizvi, Soham Kumar, Aryan Seth, Pratik Narang

TL;DR

This work tackles the challenge of degraded facial emotion recognition in low-resolution imagery by introducing AffectSRNet, a facial emotion-aware super-resolution framework that preserves expressions during upscaling. It combines a RRDB-based SR backbone with a Graph Convolutional Network over 478 facial landmarks and a Multimodal Split Attention Fusion module to inject structural landmark information into the reconstruction process. A dedicated Emotion Consistency Metric (ECM), defined as $ECM = \alpha L_H + \log(L_{\text{conf}})$ with $\alpha = 0.5$, alongside a multi-term loss that blends pixel, perceptual (style), and graph-based constraints, guides the model toward both high visual quality and expression fidelity. Experimental results on CelebA, FFHQ, and Helen show competitive image quality metrics and superior emotion fidelity (ECM) compared with state-of-the-art FSR methods, indicating strong potential for practical FER deployment in suboptimal resolution environments.

Abstract

Facial expression recognition (FER) systems in low-resolution settings face significant challenges in accurately identifying expressions due to the loss of fine-grained facial details. This limitation is especially problematic for applications like surveillance and mobile communications, where low image resolution is common and can compromise recognition accuracy. Traditional single-image face super-resolution (FSR) techniques, however, often fail to preserve the emotional intent of expressions, introducing distortions that obscure the original affective content. Given the inherently ill-posed nature of single-image super-resolution, a targeted approach is required to balance image quality enhancement with emotion retention. In this paper, we propose AffectSRNet, a novel emotion-aware super-resolution framework that reconstructs high-quality facial images from low-resolution inputs while maintaining the intensity and fidelity of facial expressions. Our method effectively bridges the gap between image resolution and expression accuracy by employing an expression-preserving loss function, specifically tailored for FER applications. Additionally, we introduce a new metric to assess emotion preservation in super-resolved images, providing a more nuanced evaluation of FER system performance in low-resolution scenarios. Experimental results on standard datasets, including CelebA, FFHQ, and Helen, demonstrate that AffectSRNet outperforms existing FSR approaches in both visual quality and emotion fidelity, highlighting its potential for integration into practical FER applications. This work not only improves image clarity but also ensures that emotion-driven applications retain their core functionality in suboptimal resolution environments, paving the way for broader adoption in FER systems.

AffectSRNet : Facial Emotion-Aware Super-Resolution Network

TL;DR

This work tackles the challenge of degraded facial emotion recognition in low-resolution imagery by introducing AffectSRNet, a facial emotion-aware super-resolution framework that preserves expressions during upscaling. It combines a RRDB-based SR backbone with a Graph Convolutional Network over 478 facial landmarks and a Multimodal Split Attention Fusion module to inject structural landmark information into the reconstruction process. A dedicated Emotion Consistency Metric (ECM), defined as with , alongside a multi-term loss that blends pixel, perceptual (style), and graph-based constraints, guides the model toward both high visual quality and expression fidelity. Experimental results on CelebA, FFHQ, and Helen show competitive image quality metrics and superior emotion fidelity (ECM) compared with state-of-the-art FSR methods, indicating strong potential for practical FER deployment in suboptimal resolution environments.

Abstract

Facial expression recognition (FER) systems in low-resolution settings face significant challenges in accurately identifying expressions due to the loss of fine-grained facial details. This limitation is especially problematic for applications like surveillance and mobile communications, where low image resolution is common and can compromise recognition accuracy. Traditional single-image face super-resolution (FSR) techniques, however, often fail to preserve the emotional intent of expressions, introducing distortions that obscure the original affective content. Given the inherently ill-posed nature of single-image super-resolution, a targeted approach is required to balance image quality enhancement with emotion retention. In this paper, we propose AffectSRNet, a novel emotion-aware super-resolution framework that reconstructs high-quality facial images from low-resolution inputs while maintaining the intensity and fidelity of facial expressions. Our method effectively bridges the gap between image resolution and expression accuracy by employing an expression-preserving loss function, specifically tailored for FER applications. Additionally, we introduce a new metric to assess emotion preservation in super-resolved images, providing a more nuanced evaluation of FER system performance in low-resolution scenarios. Experimental results on standard datasets, including CelebA, FFHQ, and Helen, demonstrate that AffectSRNet outperforms existing FSR approaches in both visual quality and emotion fidelity, highlighting its potential for integration into practical FER applications. This work not only improves image clarity but also ensures that emotion-driven applications retain their core functionality in suboptimal resolution environments, paving the way for broader adoption in FER systems.

Paper Structure

This paper contains 33 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of AffectSRNet with other super-resolution methods (SRCNN, FSRNet, DIC) on low-resolution facial images. AffectSRNet achieves superior image clarity and emotion fidelity, preserving fine facial details and expression accuracy.
  • Figure 2: The network architecture of AffectSRNet. The super-resolution backbone consists of the RRDB and upsampling blocks from ESRGANfsr36_esrgan. Facial landmarks extracted with MediapipeMediapipe are passed through GCN block to get graph embeddings. These are integrated into the super-resolution backbone, with MSAF block performing cross-modal fusion.
  • Figure 3: The edges are defined as illustrated to preserve the spatial dependence of the features important for facial expression.
  • Figure 4: The figure shows a visual comparison of leading methods applied to the Helenhelen, FFHQ FFHQ and CelebAcelebA. Visual results corresponding to an upsampling factor of 4 is shown for FFHQ and Helen. For an upsampling factor of 8, we show the comparative results on CelebA and FFHQ. Methods used are Bicubic Interpolation, SRCNSRCNN, EDSREDSR, FSRNetFSRNet, DICDIC and SPARNetSPARNet. Zoomed in images of left eye and mouth is shown to discern the quality of super-resolution. Further for differentiating the comparative perceptual quality of different methods, the image can be zoomed up to 10×.
  • Figure 5: Subjective visual performance on real-world surveillance scenarios for 8× SR, of SCface dataset. Visual comparisons are shown on two sample images from the dataset.