Table of Contents
Fetching ...

Global Multiple Extraction Network for Low-Resolution Facial Expression Recognition

Jingyi Shi

TL;DR

This work targets the challenge of recognizing facial expressions in low-resolution images by introducing GME-Net, a dual-branch architecture that combines a hybrid attention-based local feature extractor with a multi-scale global feature extractor, guided by attention-similarity knowledge distillation from a high-resolution teacher. The local branch employs Mixed-Attention Blocks with a Depthwise Block Attention Mechanism to capture fine-grained details, while the global branch uses Mixed-Channel Feature Extraction Blocks with a quasi-symmetric design to robustly model global cues. A distillation loss transfers relevant attention information from the HR teacher to the LR student, promoting consistent feature focus across resolutions. Experiments on downsampled benchmarks (RAF-DB, ExpW, FER2013, FERPlus) show GME-Net achieving superior or competitive accuracy with favorable efficiency, indicating improved robustness for LR-FER in practical scenarios.

Abstract

Facial expression recognition, as a vital computer vision task, is garnering significant attention and undergoing extensive research. Although facial expression recognition algorithms demonstrate impressive performance on high-resolution images, their effectiveness tends to degrade when confronted with low-resolution images. We find it is because: 1) low-resolution images lack detail information; 2) current methods complete weak global modeling, which make it difficult to extract discriminative features. To alleviate the above issues, we proposed a novel global multiple extraction network (GME-Net) for low-resolution facial expression recognition, which incorporates 1) a hybrid attention-based local feature extraction module with attention similarity knowledge distillation to learn image details from high-resolution network; 2) a multi-scale global feature extraction module with quasi-symmetric structure to mitigate the influence of local image noise and facilitate capturing global image features. As a result, our GME-Net is capable of extracting expression-related discriminative features. Extensive experiments conducted on several widely-used datasets demonstrate that the proposed GME-Net can better recognize low-resolution facial expression and obtain superior performance than existing solutions.

Global Multiple Extraction Network for Low-Resolution Facial Expression Recognition

TL;DR

This work targets the challenge of recognizing facial expressions in low-resolution images by introducing GME-Net, a dual-branch architecture that combines a hybrid attention-based local feature extractor with a multi-scale global feature extractor, guided by attention-similarity knowledge distillation from a high-resolution teacher. The local branch employs Mixed-Attention Blocks with a Depthwise Block Attention Mechanism to capture fine-grained details, while the global branch uses Mixed-Channel Feature Extraction Blocks with a quasi-symmetric design to robustly model global cues. A distillation loss transfers relevant attention information from the HR teacher to the LR student, promoting consistent feature focus across resolutions. Experiments on downsampled benchmarks (RAF-DB, ExpW, FER2013, FERPlus) show GME-Net achieving superior or competitive accuracy with favorable efficiency, indicating improved robustness for LR-FER in practical scenarios.

Abstract

Facial expression recognition, as a vital computer vision task, is garnering significant attention and undergoing extensive research. Although facial expression recognition algorithms demonstrate impressive performance on high-resolution images, their effectiveness tends to degrade when confronted with low-resolution images. We find it is because: 1) low-resolution images lack detail information; 2) current methods complete weak global modeling, which make it difficult to extract discriminative features. To alleviate the above issues, we proposed a novel global multiple extraction network (GME-Net) for low-resolution facial expression recognition, which incorporates 1) a hybrid attention-based local feature extraction module with attention similarity knowledge distillation to learn image details from high-resolution network; 2) a multi-scale global feature extraction module with quasi-symmetric structure to mitigate the influence of local image noise and facilitate capturing global image features. As a result, our GME-Net is capable of extracting expression-related discriminative features. Extensive experiments conducted on several widely-used datasets demonstrate that the proposed GME-Net can better recognize low-resolution facial expression and obtain superior performance than existing solutions.

Paper Structure

This paper contains 15 sections, 13 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Performance comparison between our method and other FER methods in terms of Accuracy, computational complexity (GFLOPs),and model parameters on the low-resolution RAF-DB Dataset. In both graphs, our method outperforms the others by achieving the highest accuracy while maintaining a reasonable balance of model complexity and computational cost. This showcases the efficiency and effectiveness of our GME-Net for low resolution facial expression recognition.
  • Figure 2: The example of our data set is shown in the figure above. Specifically, it is based on the public data set RAF-DB through the bicubic interpolation method, and the rest of the data set production methods are the same as above.
  • Figure 3: The overall framework of GME-Net, where MAB stands for mixed-attention block, and MCB stands for mixed channel feature extraction block. At the same time, in each MAB we extract an attention map to calculate the distillation loss.
  • Figure 4: Sub-figures (a) depict the Mixed-Attention Block (MAB), while Sub-figures (b) and Sub-figures (c) illustrate the Depthwise Block Attention Mechanism (DBAM), with Sub-figures (b) representing the Depthwise-Channel Attention Module (DCAM), and Sub-figures (c) denoting the Depthwise-Spatial Attention Module (DSAM).
  • Figure 5: The structure of Mixed-Channel Feature Extraction Block(MCB).