CS3D: An Efficient Facial Expression Recognition via Event Vision

Zhe Wang; Qijin Song; Yucen Peng; Weibang Bai

CS3D: An Efficient Facial Expression Recognition via Event Vision

Zhe Wang, Qijin Song, Yucen Peng, Weibang Bai

TL;DR

The paper tackles energy-efficient facial expression recognition using event cameras for edge robots. It introduces CS3D, a compact spatial-temporal network with factorized 3D convolutions, soft spiking neurons, and a spatio-temporal attention module, validated on event-converted FER datasets. Results show CS3D achieves higher accuracy than RNN, Transformer, and C3D baselines while consuming only about 22% of C3D's energy on a Titan X, demonstrating practical edge deployment potential. Real-world experiments under varying lighting conditions further confirm robustness and energy efficiency.

Abstract

Responsive and accurate facial expression recognition is crucial to human-robot interaction for daily service robots. Nowadays, event cameras are becoming more widely adopted as they surpass RGB cameras in capturing facial expression changes due to their high temporal resolution, low latency, computational efficiency, and robustness in low-light conditions. Despite these advantages, event-based approaches still encounter practical challenges, particularly in adopting mainstream deep learning models. Traditional deep learning methods for facial expression analysis are energy-intensive, making them difficult to deploy on edge computing devices and thereby increasing costs, especially for high-frequency, dynamic, event vision-based approaches. To address this challenging issue, we proposed the CS3D framework by decomposing the Convolutional 3D method to reduce the computational complexity and energy consumption. Additionally, by utilizing soft spiking neurons and a spatial-temporal attention mechanism, the ability to retain information is enhanced, thus improving the accuracy of facial expression detection. Experimental results indicate that our proposed CS3D method attains higher accuracy on multiple datasets compared to architectures such as the RNN, Transformer, and C3D, while the energy consumption of the CS3D method is just 21.97\% of the original C3D required on the same device.

CS3D: An Efficient Facial Expression Recognition via Event Vision

TL;DR

Abstract

CS3D: An Efficient Facial Expression Recognition via Event Vision

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)