Apprenticeship-Inspired Elegance: Synergistic Knowledge Distillation Empowers Spiking Neural Networks for Efficient Single-Eye Emotion Recognition
Yang Wang, Haiyang Mei, Qirui Bao, Ziqi Wei, Mike Zheng Shou, Haizhou Li, Bo Dong, Xin Yang
TL;DR
The paper tackles real-time single-eye emotion recognition in VR/AR settings, where full-face cues are often unavailable and expensive event cameras hinder practicality. It introduces a novel apprenticeship-inspired multimodality synergistic knowledge distillation (MSKD) framework that trains a multimodal ANN-SNN-hybrid teacher on event and frame data and distills its knowledge into a lightweight frame-based SNN student, using two novel losses: Hit Consistency Knowledge Distillation and Temporal Consistency Knowledge Distillation, guided by a total objective $L_{total} = (1 - \alpha) \mathcal{L}_{Cls} + \alpha \mathcal{L}_{HCKD} + \alpha \mathcal{L}_{TCKD}$. The approach is validated on SEE and a newly created DSEE dataset, achieving state-of-the-art or near-state-of-the-art accuracy while significantly improving efficiency and reducing reliance on expensive event cameras. The work also contributes a scalable single-eye emotion benchmark (DSEE) and demonstrates strong practical impact for energy-efficient, real-time emotion recognition on wearable devices in VR/AR.
Abstract
We introduce a novel multimodality synergistic knowledge distillation scheme tailored for efficient single-eye motion recognition tasks. This method allows a lightweight, unimodal student spiking neural network (SNN) to extract rich knowledge from an event-frame multimodal teacher network. The core strength of this approach is its ability to utilize the ample, coarser temporal cues found in conventional frames for effective emotion recognition. Consequently, our method adeptly interprets both temporal and spatial information from the conventional frame domain, eliminating the need for specialized sensing devices, e.g., event-based camera. The effectiveness of our approach is thoroughly demonstrated using both existing and our compiled single-eye emotion recognition datasets, achieving unparalleled performance in accuracy and efficiency over existing state-of-the-art methods.
