Table of Contents
Fetching ...

Apprenticeship-Inspired Elegance: Synergistic Knowledge Distillation Empowers Spiking Neural Networks for Efficient Single-Eye Emotion Recognition

Yang Wang, Haiyang Mei, Qirui Bao, Ziqi Wei, Mike Zheng Shou, Haizhou Li, Bo Dong, Xin Yang

TL;DR

The paper tackles real-time single-eye emotion recognition in VR/AR settings, where full-face cues are often unavailable and expensive event cameras hinder practicality. It introduces a novel apprenticeship-inspired multimodality synergistic knowledge distillation (MSKD) framework that trains a multimodal ANN-SNN-hybrid teacher on event and frame data and distills its knowledge into a lightweight frame-based SNN student, using two novel losses: Hit Consistency Knowledge Distillation and Temporal Consistency Knowledge Distillation, guided by a total objective $L_{total} = (1 - \alpha) \mathcal{L}_{Cls} + \alpha \mathcal{L}_{HCKD} + \alpha \mathcal{L}_{TCKD}$. The approach is validated on SEE and a newly created DSEE dataset, achieving state-of-the-art or near-state-of-the-art accuracy while significantly improving efficiency and reducing reliance on expensive event cameras. The work also contributes a scalable single-eye emotion benchmark (DSEE) and demonstrates strong practical impact for energy-efficient, real-time emotion recognition on wearable devices in VR/AR.

Abstract

We introduce a novel multimodality synergistic knowledge distillation scheme tailored for efficient single-eye motion recognition tasks. This method allows a lightweight, unimodal student spiking neural network (SNN) to extract rich knowledge from an event-frame multimodal teacher network. The core strength of this approach is its ability to utilize the ample, coarser temporal cues found in conventional frames for effective emotion recognition. Consequently, our method adeptly interprets both temporal and spatial information from the conventional frame domain, eliminating the need for specialized sensing devices, e.g., event-based camera. The effectiveness of our approach is thoroughly demonstrated using both existing and our compiled single-eye emotion recognition datasets, achieving unparalleled performance in accuracy and efficiency over existing state-of-the-art methods.

Apprenticeship-Inspired Elegance: Synergistic Knowledge Distillation Empowers Spiking Neural Networks for Efficient Single-Eye Emotion Recognition

TL;DR

The paper tackles real-time single-eye emotion recognition in VR/AR settings, where full-face cues are often unavailable and expensive event cameras hinder practicality. It introduces a novel apprenticeship-inspired multimodality synergistic knowledge distillation (MSKD) framework that trains a multimodal ANN-SNN-hybrid teacher on event and frame data and distills its knowledge into a lightweight frame-based SNN student, using two novel losses: Hit Consistency Knowledge Distillation and Temporal Consistency Knowledge Distillation, guided by a total objective . The approach is validated on SEE and a newly created DSEE dataset, achieving state-of-the-art or near-state-of-the-art accuracy while significantly improving efficiency and reducing reliance on expensive event cameras. The work also contributes a scalable single-eye emotion benchmark (DSEE) and demonstrates strong practical impact for energy-efficient, real-time emotion recognition on wearable devices in VR/AR.

Abstract

We introduce a novel multimodality synergistic knowledge distillation scheme tailored for efficient single-eye motion recognition tasks. This method allows a lightweight, unimodal student spiking neural network (SNN) to extract rich knowledge from an event-frame multimodal teacher network. The core strength of this approach is its ability to utilize the ample, coarser temporal cues found in conventional frames for effective emotion recognition. Consequently, our method adeptly interprets both temporal and spatial information from the conventional frame domain, eliminating the need for specialized sensing devices, e.g., event-based camera. The effectiveness of our approach is thoroughly demonstrated using both existing and our compiled single-eye emotion recognition datasets, achieving unparalleled performance in accuracy and efficiency over existing state-of-the-art methods.
Paper Structure (20 sections, 5 equations, 4 figures, 7 tables)

This paper contains 20 sections, 5 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Surpassing the state-of-the-art SEEN in real-time single-eye emotion recognition (SER), our method achieves lightweight inference by requiring solely intensity frames, obviating the need for data from expensive event cameras. This is facilitated by our novel synergistic knowledge distillation strategy, enabling real-time and accurate SER on resource-constrained devices for the first time.
  • Figure 2: Overview of our proposed multimodality synergistic knowledge distill (MSKD) framework (a) which consists of a multimodal input ANN-SNN-hybrid teacher network (top) and an unimodal input SNN student network (bottom), as well as two synergistic knowledge distill loss items: (b) hit consistency loss and (c) temporal consistency loss.
  • Figure 3: Examples from our DSEE dataset.
  • Figure 4: Illustration of the single-eye events data synthesis process.