Table of Contents
Fetching ...

FER-YOLO-Mamba: Facial Expression Detection and Classification Based on Selective State Space

Hui Ma, Sen Lei, Turgay Celik, Heng-Chao Li

TL;DR

FER-YOLO-Mamba integrates a Vision Mamba (SSM-based) backbone with a YOLO detector to address long-range dependencies and computational overhead in facial expression recognition. The FER-YOLO-VSS module combines a FRM branch with ABMLP-based attention and an OSS branch with OSSM to fuse local and global features across multiple directions, enabling robust FER in cluttered real-world scenes. Evaluations on RAF-DB and SFEW show competitive to state-of-the-art $mAP$ and per-class AP, with strong performance on Happy and Surprise while noting challenges for Fear and Neutral, and ablations confirm the contribution of FRM/OSS/OSSM components. The approach demonstrates that coupling linear-complexity SSM-based global modeling with YOLO-style detection can yield accurate, efficient FER suitable for practical applications without preprocessing.

Abstract

Facial Expression Recognition (FER) plays a pivotal role in understanding human emotional cues. However, traditional FER methods based on visual information have some limitations, such as preprocessing, feature extraction, and multi-stage classification procedures. These not only increase computational complexity but also require a significant amount of computing resources. Considering Convolutional Neural Network (CNN)-based FER schemes frequently prove inadequate in identifying the deep, long-distance dependencies embedded within facial expression images, and the Transformer's inherent quadratic computational complexity, this paper presents the FER-YOLO-Mamba model, which integrates the principles of Mamba and YOLO technologies to facilitate efficient coordination in facial expression image recognition and localization. Within the FER-YOLO-Mamba model, we further devise a FER-YOLO-VSS dual-branch module, which combines the inherent strengths of convolutional layers in local feature extraction with the exceptional capability of State Space Models (SSMs) in revealing long-distance dependencies. To the best of our knowledge, this is the first Vision Mamba model designed for facial expression detection and classification. To evaluate the performance of the proposed FER-YOLO-Mamba model, we conducted experiments on two benchmark datasets, RAF-DB and SFEW. The experimental results indicate that the FER-YOLO-Mamba model achieved better results compared to other models. The code is available from https://github.com/SwjtuMa/FER-YOLO-Mamba.

FER-YOLO-Mamba: Facial Expression Detection and Classification Based on Selective State Space

TL;DR

FER-YOLO-Mamba integrates a Vision Mamba (SSM-based) backbone with a YOLO detector to address long-range dependencies and computational overhead in facial expression recognition. The FER-YOLO-VSS module combines a FRM branch with ABMLP-based attention and an OSS branch with OSSM to fuse local and global features across multiple directions, enabling robust FER in cluttered real-world scenes. Evaluations on RAF-DB and SFEW show competitive to state-of-the-art and per-class AP, with strong performance on Happy and Surprise while noting challenges for Fear and Neutral, and ablations confirm the contribution of FRM/OSS/OSSM components. The approach demonstrates that coupling linear-complexity SSM-based global modeling with YOLO-style detection can yield accurate, efficient FER suitable for practical applications without preprocessing.

Abstract

Facial Expression Recognition (FER) plays a pivotal role in understanding human emotional cues. However, traditional FER methods based on visual information have some limitations, such as preprocessing, feature extraction, and multi-stage classification procedures. These not only increase computational complexity but also require a significant amount of computing resources. Considering Convolutional Neural Network (CNN)-based FER schemes frequently prove inadequate in identifying the deep, long-distance dependencies embedded within facial expression images, and the Transformer's inherent quadratic computational complexity, this paper presents the FER-YOLO-Mamba model, which integrates the principles of Mamba and YOLO technologies to facilitate efficient coordination in facial expression image recognition and localization. Within the FER-YOLO-Mamba model, we further devise a FER-YOLO-VSS dual-branch module, which combines the inherent strengths of convolutional layers in local feature extraction with the exceptional capability of State Space Models (SSMs) in revealing long-distance dependencies. To the best of our knowledge, this is the first Vision Mamba model designed for facial expression detection and classification. To evaluate the performance of the proposed FER-YOLO-Mamba model, we conducted experiments on two benchmark datasets, RAF-DB and SFEW. The experimental results indicate that the FER-YOLO-Mamba model achieved better results compared to other models. The code is available from https://github.com/SwjtuMa/FER-YOLO-Mamba.
Paper Structure (18 sections, 2 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 2 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: The overall architecture of the FER-YOLO-Mamba.
  • Figure 2: FER-YOLO-VSS2 module.
  • Figure 3: Comparison of different network models in terms of Params (M) and FLOPs (G).
  • Figure 4: Test sample detection results and corresponding heatmaps on RAF-DB.
  • Figure 5: Test sample detection results and corresponding heatmaps on SFEW.