DeepFace-Attention: Multimodal Face Biometrics for Attention Estimation with Application to e-Learning
Roberto Daza, Luis F. Gomez, Julian Fierrez, Aythami Morales, Ruben Tolosana, Javier Ortega-Garcia
TL;DR
This work tackles the challenge of estimating students' attention (cognitive load) in e-learning from webcam video. It introduces DeepFace-Attention, a multimodal framework that combines five CNN-based face-analysis modules to extract both local and global features across multiple temporal windows, and it fuses these cues with score-level fusion and neural networks. The system is evaluated on the mEBAL2 dataset, demonstrating that eye-related cues and facial expressions are the most informative, that longer windows improve discrimination, and that local-feature fusion via neural networks achieves the best binary attention accuracy of 85.92%, outperforming state-of-the-art methods. The approach offers a practical, noninvasive means to monitor attention in online education, enabling adaptive feedback and interventions.
Abstract
This work introduces an innovative method for estimating attention levels (cognitive load) using an ensemble of facial analysis techniques applied to webcam videos. Our method is particularly useful, among others, in e-learning applications, so we trained, evaluated, and compared our approach on the mEBAL2 database, a public multi-modal database acquired in an e-learning environment. mEBAL2 comprises data from 60 users who performed 8 different tasks. These tasks varied in difficulty, leading to changes in their cognitive loads. Our approach adapts state-of-the-art facial analysis technologies to quantify the users' cognitive load in the form of high or low attention. Several behavioral signals and physiological processes related to the cognitive load are used, such as eyeblink, heart rate, facial action units, and head pose, among others. Furthermore, we conduct a study to understand which individual features obtain better results, the most efficient combinations, explore local and global features, and how temporary time intervals affect attention level estimation, among other aspects. We find that global facial features are more appropriate for multimodal systems using score-level fusion, particularly as the temporal window increases. On the other hand, local features are more suitable for fusion through neural network training with score-level fusion approaches. Our method outperforms existing state-of-the-art accuracies using the public mEBAL2 benchmark.
