Table of Contents
Fetching ...

CS3D: An Efficient Facial Expression Recognition via Event Vision

Zhe Wang, Qijin Song, Yucen Peng, Weibang Bai

TL;DR

The paper tackles energy-efficient facial expression recognition using event cameras for edge robots. It introduces CS3D, a compact spatial-temporal network with factorized 3D convolutions, soft spiking neurons, and a spatio-temporal attention module, validated on event-converted FER datasets. Results show CS3D achieves higher accuracy than RNN, Transformer, and C3D baselines while consuming only about 22% of C3D's energy on a Titan X, demonstrating practical edge deployment potential. Real-world experiments under varying lighting conditions further confirm robustness and energy efficiency.

Abstract

Responsive and accurate facial expression recognition is crucial to human-robot interaction for daily service robots. Nowadays, event cameras are becoming more widely adopted as they surpass RGB cameras in capturing facial expression changes due to their high temporal resolution, low latency, computational efficiency, and robustness in low-light conditions. Despite these advantages, event-based approaches still encounter practical challenges, particularly in adopting mainstream deep learning models. Traditional deep learning methods for facial expression analysis are energy-intensive, making them difficult to deploy on edge computing devices and thereby increasing costs, especially for high-frequency, dynamic, event vision-based approaches. To address this challenging issue, we proposed the CS3D framework by decomposing the Convolutional 3D method to reduce the computational complexity and energy consumption. Additionally, by utilizing soft spiking neurons and a spatial-temporal attention mechanism, the ability to retain information is enhanced, thus improving the accuracy of facial expression detection. Experimental results indicate that our proposed CS3D method attains higher accuracy on multiple datasets compared to architectures such as the RNN, Transformer, and C3D, while the energy consumption of the CS3D method is just 21.97\% of the original C3D required on the same device.

CS3D: An Efficient Facial Expression Recognition via Event Vision

TL;DR

The paper tackles energy-efficient facial expression recognition using event cameras for edge robots. It introduces CS3D, a compact spatial-temporal network with factorized 3D convolutions, soft spiking neurons, and a spatio-temporal attention module, validated on event-converted FER datasets. Results show CS3D achieves higher accuracy than RNN, Transformer, and C3D baselines while consuming only about 22% of C3D's energy on a Titan X, demonstrating practical edge deployment potential. Real-world experiments under varying lighting conditions further confirm robustness and energy efficiency.

Abstract

Responsive and accurate facial expression recognition is crucial to human-robot interaction for daily service robots. Nowadays, event cameras are becoming more widely adopted as they surpass RGB cameras in capturing facial expression changes due to their high temporal resolution, low latency, computational efficiency, and robustness in low-light conditions. Despite these advantages, event-based approaches still encounter practical challenges, particularly in adopting mainstream deep learning models. Traditional deep learning methods for facial expression analysis are energy-intensive, making them difficult to deploy on edge computing devices and thereby increasing costs, especially for high-frequency, dynamic, event vision-based approaches. To address this challenging issue, we proposed the CS3D framework by decomposing the Convolutional 3D method to reduce the computational complexity and energy consumption. Additionally, by utilizing soft spiking neurons and a spatial-temporal attention mechanism, the ability to retain information is enhanced, thus improving the accuracy of facial expression detection. Experimental results indicate that our proposed CS3D method attains higher accuracy on multiple datasets compared to architectures such as the RNN, Transformer, and C3D, while the energy consumption of the CS3D method is just 21.97\% of the original C3D required on the same device.

Paper Structure

This paper contains 19 sections, 4 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview of the proposed CS3D framework. The upper row describes the overall architecture of the CS3D. The bottom row illustrates the FactorizedConv3D module and the spatial-temporal joint attention module integrated in the framework. FactorizedConv3D decomposes standard 3D convolutions to reduce the number of parameters and lower the time and space costs of model operation. The spatial-temporal joint attention module integrates temporal and spatial attention, enhancing the model’s ability to capture critical temporal and spatial information in the event stream.
  • Figure 2: Temporal Attention hu2018squeezechen2024ehoa
  • Figure 3: Spatial Attention roy2018concurrentchen2024ehoa
  • Figure 4: Visualization of the raw event streams and the output results of our CS3D method, demonstrating three emotion tasks: Surprise (SZU-EmoDage), Anger (ADFES), and Others (CASME II).
  • Figure 5: The event camera performs facial expression recognition in a sufficient light environment.
  • ...and 1 more figures