Table of Contents
Fetching ...

DeepFace-Attention: Multimodal Face Biometrics for Attention Estimation with Application to e-Learning

Roberto Daza, Luis F. Gomez, Julian Fierrez, Aythami Morales, Ruben Tolosana, Javier Ortega-Garcia

TL;DR

This work tackles the challenge of estimating students' attention (cognitive load) in e-learning from webcam video. It introduces DeepFace-Attention, a multimodal framework that combines five CNN-based face-analysis modules to extract both local and global features across multiple temporal windows, and it fuses these cues with score-level fusion and neural networks. The system is evaluated on the mEBAL2 dataset, demonstrating that eye-related cues and facial expressions are the most informative, that longer windows improve discrimination, and that local-feature fusion via neural networks achieves the best binary attention accuracy of 85.92%, outperforming state-of-the-art methods. The approach offers a practical, noninvasive means to monitor attention in online education, enabling adaptive feedback and interventions.

Abstract

This work introduces an innovative method for estimating attention levels (cognitive load) using an ensemble of facial analysis techniques applied to webcam videos. Our method is particularly useful, among others, in e-learning applications, so we trained, evaluated, and compared our approach on the mEBAL2 database, a public multi-modal database acquired in an e-learning environment. mEBAL2 comprises data from 60 users who performed 8 different tasks. These tasks varied in difficulty, leading to changes in their cognitive loads. Our approach adapts state-of-the-art facial analysis technologies to quantify the users' cognitive load in the form of high or low attention. Several behavioral signals and physiological processes related to the cognitive load are used, such as eyeblink, heart rate, facial action units, and head pose, among others. Furthermore, we conduct a study to understand which individual features obtain better results, the most efficient combinations, explore local and global features, and how temporary time intervals affect attention level estimation, among other aspects. We find that global facial features are more appropriate for multimodal systems using score-level fusion, particularly as the temporal window increases. On the other hand, local features are more suitable for fusion through neural network training with score-level fusion approaches. Our method outperforms existing state-of-the-art accuracies using the public mEBAL2 benchmark.

DeepFace-Attention: Multimodal Face Biometrics for Attention Estimation with Application to e-Learning

TL;DR

This work tackles the challenge of estimating students' attention (cognitive load) in e-learning from webcam video. It introduces DeepFace-Attention, a multimodal framework that combines five CNN-based face-analysis modules to extract both local and global features across multiple temporal windows, and it fuses these cues with score-level fusion and neural networks. The system is evaluated on the mEBAL2 dataset, demonstrating that eye-related cues and facial expressions are the most informative, that longer windows improve discrimination, and that local-feature fusion via neural networks achieves the best binary attention accuracy of 85.92%, outperforming state-of-the-art methods. The approach offers a practical, noninvasive means to monitor attention in online education, enabling adaptive feedback and interventions.

Abstract

This work introduces an innovative method for estimating attention levels (cognitive load) using an ensemble of facial analysis techniques applied to webcam videos. Our method is particularly useful, among others, in e-learning applications, so we trained, evaluated, and compared our approach on the mEBAL2 database, a public multi-modal database acquired in an e-learning environment. mEBAL2 comprises data from 60 users who performed 8 different tasks. These tasks varied in difficulty, leading to changes in their cognitive loads. Our approach adapts state-of-the-art facial analysis technologies to quantify the users' cognitive load in the form of high or low attention. Several behavioral signals and physiological processes related to the cognitive load are used, such as eyeblink, heart rate, facial action units, and head pose, among others. Furthermore, we conduct a study to understand which individual features obtain better results, the most efficient combinations, explore local and global features, and how temporary time intervals affect attention level estimation, among other aspects. We find that global facial features are more appropriate for multimodal systems using score-level fusion, particularly as the temporal window increases. On the other hand, local features are more suitable for fusion through neural network training with score-level fusion approaches. Our method outperforms existing state-of-the-art accuracies using the public mEBAL2 benchmark.
Paper Structure (24 sections, 5 equations, 8 figures, 10 tables)

This paper contains 24 sections, 5 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Examples of different real students' attention levels during an e-learning session. (Top) High attention image sequence. (Bottom) Low attention image sequence.
  • Figure 2: Probability Density Function of obtained attention with EEG band from 60 students in the mEBAL2 database daza2024mebal2, along with our attention levels classification (high, normal, low) with used thresholds ($\tau_L$, $\tau_H$).
  • Figure 3: Feature extraction from the Landmark Detection module. On the right eye, we show Eye Aspect Ratio (EAR) calculations. We also display the landmarks used to extract the width and height of the nose and head.
  • Figure 4: Block diagram of the proposed multimodal approach for attention estimation (DeepFace-Attention). The dashed line represents the ground truth used for training the SVMs. The two strategies used, global features ($\textbf{f}_{\textnormal{G}}$) and local features ($\textbf{f}_{\textnormal{L}}$), are shown. The feature vectors from each module are denoted as $\mathbf{f}^{\mathnormal{y}}_{\mathnormal{x}}$, and the score for each SVM is denoted as ${\mathnormal{s}_{\mathnormal{x}}^\mathnormal{y}}$. Here, ${\mathnormal{x} \in \{\textnormal{L}, \textnormal{G}\}}$ specifies whether the features are global or local, and $\mathnormal{y}$ represents the facial feature category, $\mathnormal{y} \in \{\textnormal{EB}, \textnormal{HP}, \textnormal{EAR}, \ldots\}$. Finally, $\mathnormal{s}^{\textnormal{F}}$ represents the fusion of scores.
  • Figure 5: Block diagram using an approach of selection and fusion of global features for attention estimation. The dashed line represents the ground truth used for training the SVM. The global feature vector is denoted as $\mathbf{f}^{\mathnormal{y}}_{\textnormal{G}}$, where $\mathnormal{y}$ represents the facial feature category, $\mathnormal{y} \in \{\textnormal{EB}, \textnormal{HP}, \textnormal{EAR}, \ldots\}$. $\mathbf{f}^{\textnormal{S}}_{\textnormal{G}}$ represents the vector of selected global features. Finally, the score obtained from the SVM is denoted as ${\mathnormal{s}_{\textnormal{G}}}$.
  • ...and 3 more figures