Real-time estimation of overt attention from dynamic features of the face using deep-learning

Aimar Silvan Ortubay; Lucas C. Parra; Jens Madsen

Real-time estimation of overt attention from dynamic features of the face using deep-learning

Aimar Silvan Ortubay, Lucas C. Parra, Jens Madsen

TL;DR

This work uses Inter-Subject Correlation (ISC) of eye movements as an index of attention and trains an AI to predict attention from a single subject’s facial dynamics alone, removing the need for a reference group or human labeling of attention.

Abstract

Students often drift in and out of focus during class. Effective teachers recognize this and re-engage them when necessary. With the shift to remote learning, teachers have lost the visual feedback needed to adapt to varying student engagement. We propose using readily available front-facing video to infer attention levels based on movements of the eyes, head, and face. We train a deep learning model to predict a measure of attention based on overt eye movements. Specifically, we measure Inter-Subject Correlation of eye movements in ten-second intervals while students watch the same educational videos. In 3 different experiments (N=83) we show that the trained model predicts this objective metric of attention on unseen data with $R^2$=0.38, and on unseen subjects with $R^2$=0.26-0.30. The deep network relies mostly on a student's eye movements, but to some extent also on movements of the brows, cheeks, and head. In contrast to Inter-Subject Correlation of the eyes, the model can estimate attentional engagement from individual students' movements without needing reference data from an attentive group. This enables a much broader set of online applications. The solution is lightweight and can operate on the client side, which mitigates some of the privacy concerns associated with online attention monitoring. GitHub implementation is available at https://github.com/asortubay/timeISC

Real-time estimation of overt attention from dynamic features of the face using deep-learning

TL;DR

Abstract

=0.38, and on unseen subjects with

=0.26-0.30. The deep network relies mostly on a student's eye movements, but to some extent also on movements of the brows, cheeks, and head. In contrast to Inter-Subject Correlation of the eyes, the model can estimate attentional engagement from individual students' movements without needing reference data from an attentive group. This enables a much broader set of online applications. The solution is lightweight and can operate on the client side, which mitigates some of the privacy concerns associated with online attention monitoring. GitHub implementation is available at https://github.com/asortubay/timeISC

Paper Structure (8 sections, 2 equations, 3 figures, 3 tables)

This paper contains 8 sections, 2 equations, 3 figures, 3 tables.

Introduction
Methods
Datasets
Face tracking and landmark normalization
Time-resolved Inter-Subject Correlation
Deep-learning estimation of time-resolved ISC
Results & Discussion
Conclusion

Figures (3)

Figure 1: Canonical face showing the vertices and 478 landmarks tracked by Mediapipe FaceMesh, colored points depict the landmarks used in affine transformation (magenta) and the tracked iris (cyan)
Figure 2: Engagement prediction framework: 10 seconds of dynamic face features predict a single Inter-Subject Correlation (ISC) value. A: Iris movements tracked by FaceMesh exhibit ISC similar to that of gaze position captured by a research-grade eyetracker ($r(24)=0.89, p=1.2e^{-9}$), data from Exp. 1 (N=26). B: 10-second windows of FaceMesh iris movements are used to compute time-resolved ISC, in one-second steps. Two attentive subjects (S1 and S2) show similar iris movements and have higher instantaneous and average ISC, compared to a non-attentive subject (S3). C: An example of a single instance prediction, the model will learn to predict an ISC value from 10 seconds of preceding dynamic face and head movements (64 features) of a single student. The best-performing model has a series of spatiotemporal convolutions, ReLU activations, batch normalization, and max-pooling layer, followed by an LSTM and fully connected layers.
Figure 3: The attention prediction relies heavily on eye movements, and to a lesser extent on the head, eyebrows, cheeks, and mouth movements. %-change in Mean Absolute Error per participant on feature suppression study, model errors significantly increase (***$p<0.001$, *$p<0.05$, paired-tailed $t$-test), for Exp. 2 & 3 participants (df=N=57) after zeroing out specific predictor features vs. unchanged (‘none’)

Real-time estimation of overt attention from dynamic features of the face using deep-learning

TL;DR

Abstract

Real-time estimation of overt attention from dynamic features of the face using deep-learning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)