Table of Contents
Fetching ...

PAtt-Lite: Lightweight Patch and Attention MobileNet for Challenging Facial Expression Recognition

Jia Le Ngwe, Kian Ming Lim, Chin Poo Lee, Thian Song Ong

TL;DR

The paper tackles facial expression recognition under challenging real-world conditions by proposing PAtt-Lite, a lightweight patch and attention network built on a truncated MobileNetV1 backbone. It introduces a lightweight patch extraction block to force local feature learning and a dot-product self-attention classifier to better exploit patched representations, all while keeping the model small (1.10M parameters). Evaluations on CK+, RAF-DB, FER2013, and FERPlus, including challenging subsets, show state-of-the-art performance with strong robustness to occlusion and pose variations. Ablation studies confirm the patch extraction block and the attention classifier as the key contributors to performance gains, with patch extraction outperforming patch attention in this setting. The work demonstrates that efficient, edge-friendly FER is feasible without sacrificing accuracy, enabling practical deployment on mobile and embedded devices.

Abstract

Facial Expression Recognition (FER) is a machine learning problem that deals with recognizing human facial expressions. While existing work has achieved performance improvements in recent years, FER in the wild and under challenging conditions remains a challenge. In this paper, a lightweight patch and attention network based on MobileNetV1, referred to as PAtt-Lite, is proposed to improve FER performance under challenging conditions. A truncated ImageNet-pre-trained MobileNetV1 is utilized as the backbone feature extractor of the proposed method. In place of the truncated layers is a patch extraction block that is proposed for extracting significant local facial features to enhance the representation from MobileNetV1, especially under challenging conditions. An attention classifier is also proposed to improve the learning of these patched feature maps from the extremely lightweight feature extractor. The experimental results on public benchmark databases proved the effectiveness of the proposed method. PAtt-Lite achieved state-of-the-art results on CK+, RAF-DB, FER2013, FERPlus, and the challenging conditions subsets for RAF-DB and FERPlus.

PAtt-Lite: Lightweight Patch and Attention MobileNet for Challenging Facial Expression Recognition

TL;DR

The paper tackles facial expression recognition under challenging real-world conditions by proposing PAtt-Lite, a lightweight patch and attention network built on a truncated MobileNetV1 backbone. It introduces a lightweight patch extraction block to force local feature learning and a dot-product self-attention classifier to better exploit patched representations, all while keeping the model small (1.10M parameters). Evaluations on CK+, RAF-DB, FER2013, and FERPlus, including challenging subsets, show state-of-the-art performance with strong robustness to occlusion and pose variations. Ablation studies confirm the patch extraction block and the attention classifier as the key contributors to performance gains, with patch extraction outperforming patch attention in this setting. The work demonstrates that efficient, edge-friendly FER is feasible without sacrificing accuracy, enabling practical deployment on mobile and embedded devices.

Abstract

Facial Expression Recognition (FER) is a machine learning problem that deals with recognizing human facial expressions. While existing work has achieved performance improvements in recent years, FER in the wild and under challenging conditions remains a challenge. In this paper, a lightweight patch and attention network based on MobileNetV1, referred to as PAtt-Lite, is proposed to improve FER performance under challenging conditions. A truncated ImageNet-pre-trained MobileNetV1 is utilized as the backbone feature extractor of the proposed method. In place of the truncated layers is a patch extraction block that is proposed for extracting significant local facial features to enhance the representation from MobileNetV1, especially under challenging conditions. An attention classifier is also proposed to improve the learning of these patched feature maps from the extremely lightweight feature extractor. The experimental results on public benchmark databases proved the effectiveness of the proposed method. PAtt-Lite achieved state-of-the-art results on CK+, RAF-DB, FER2013, FERPlus, and the challenging conditions subsets for RAF-DB and FERPlus.
Paper Structure (25 sections, 6 equations, 10 figures, 11 tables)

This paper contains 25 sections, 6 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Architecture of the proposed PAtt-Lite. The image sample will first go through the truncated MobileNetV1 for feature extraction, in which the output feature maps will be padded and used as input for the proposed patch extraction block. The output feature maps of dimensions $2 \times 2 \times 256$ from the patch extraction block will then be global average pooled before being taken by the attention classifier.
  • Figure 2: Confusion matrices of patch extraction on challenging subsets of RAF-DB.
  • Figure 3: Confusion matrices of patch attention on challenging subsets of RAF-DB.
  • Figure 4: Confusion matrices of patch extraction on challenging subsets of FERPlus.
  • Figure 5: Confusion matrices of patch attention on challenging subsets of FERPlus.
  • ...and 5 more figures