Table of Contents
Fetching ...

Transformer-Driven Modeling of Variable Frequency Features for Classifying Student Engagement in Online Learning

Sandeep Mandia, Kuldeep Singh, Rajendra Mitharwal, Faisel Mushtaq, Dimpal Janu

TL;DR

The paper tackles automatic student engagement classification in online learning, a task intensified by the shift to remote education. It introduces EngageFormer, a three-view transformer architecture that processes video via per-view encoders, cross-view attention fusion, a sequence pooling mechanism, and a global encoder to fuse information before final MLP classification. The method is validated on DAiSEE, BAUM-1, YawDD, UTA-RLDD, and a curated learning-centered affective state dataset, achieving state-of-the-art results on several benchmarks (e.g., 63.9% on DAiSEE, 56.73% on BAUM-1, 99.16% on YawDD) and 74.89% on the curated set, with RLDD serving as a baseline. The work demonstrates effective modeling of slowly and rapidly varying temporal features in video and provides a foundation for adaptive online-learning interventions and future research with larger, more diverse datasets.

Abstract

The COVID-19 pandemic and the internet's availability have recently boosted online learning. However, monitoring engagement in online learning is a difficult task for teachers. In this context, timely automatic student engagement classification can help teachers in making adaptive adjustments to meet students' needs. This paper proposes EngageFormer, a transformer based architecture with sequence pooling using video modality for engagement classification. The proposed architecture computes three views from the input video and processes them in parallel using transformer encoders; the global encoder then processes the representation from each encoder, and finally, multi layer perceptron (MLP) predicts the engagement level. A learning centered affective state dataset is curated from existing open source databases. The proposed method achieved an accuracy of 63.9%, 56.73%, 99.16%, 65.67%, and 74.89% on Dataset for Affective States in E-Environments (DAiSEE), Bahcesehir University Multimodal Affective Database-1 (BAUM-1), Yawning Detection Dataset (YawDD), University of Texas at Arlington Real-Life Drowsiness Dataset (UTA-RLDD), and curated learning-centered affective state dataset respectively. The achieved results on the BAUM-1, DAiSEE, and YawDD datasets demonstrate state-of-the-art performance, indicating the superiority of the proposed model in accurately classifying affective states on these datasets. Additionally, the results obtained on the UTA-RLDD dataset, which involves two-class classification, serve as a baseline for future research. These results provide a foundation for further investigations and serve as a point of reference for future works to compare and improve upon.

Transformer-Driven Modeling of Variable Frequency Features for Classifying Student Engagement in Online Learning

TL;DR

The paper tackles automatic student engagement classification in online learning, a task intensified by the shift to remote education. It introduces EngageFormer, a three-view transformer architecture that processes video via per-view encoders, cross-view attention fusion, a sequence pooling mechanism, and a global encoder to fuse information before final MLP classification. The method is validated on DAiSEE, BAUM-1, YawDD, UTA-RLDD, and a curated learning-centered affective state dataset, achieving state-of-the-art results on several benchmarks (e.g., 63.9% on DAiSEE, 56.73% on BAUM-1, 99.16% on YawDD) and 74.89% on the curated set, with RLDD serving as a baseline. The work demonstrates effective modeling of slowly and rapidly varying temporal features in video and provides a foundation for adaptive online-learning interventions and future research with larger, more diverse datasets.

Abstract

The COVID-19 pandemic and the internet's availability have recently boosted online learning. However, monitoring engagement in online learning is a difficult task for teachers. In this context, timely automatic student engagement classification can help teachers in making adaptive adjustments to meet students' needs. This paper proposes EngageFormer, a transformer based architecture with sequence pooling using video modality for engagement classification. The proposed architecture computes three views from the input video and processes them in parallel using transformer encoders; the global encoder then processes the representation from each encoder, and finally, multi layer perceptron (MLP) predicts the engagement level. A learning centered affective state dataset is curated from existing open source databases. The proposed method achieved an accuracy of 63.9%, 56.73%, 99.16%, 65.67%, and 74.89% on Dataset for Affective States in E-Environments (DAiSEE), Bahcesehir University Multimodal Affective Database-1 (BAUM-1), Yawning Detection Dataset (YawDD), University of Texas at Arlington Real-Life Drowsiness Dataset (UTA-RLDD), and curated learning-centered affective state dataset respectively. The achieved results on the BAUM-1, DAiSEE, and YawDD datasets demonstrate state-of-the-art performance, indicating the superiority of the proposed model in accurately classifying affective states on these datasets. Additionally, the results obtained on the UTA-RLDD dataset, which involves two-class classification, serve as a baseline for future research. These results provide a foundation for further investigations and serve as a point of reference for future works to compare and improve upon.

Paper Structure

This paper contains 20 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Proposed student engagement classification methodology. Where $L$ is the number of encoder layers and $T, W, H,W$ are dimensions of the pre-processed video.
  • Figure 2: Proposed transformer architecture
  • Figure 3: Encoder with cross-view attention fusion
  • Figure 4: Confusion matrix for classification on (a) BAUM-1s dataset and (b) YawDD dataset
  • Figure 5: Confusion matrix for (a) DAiSEE dataset (b) Learning centered affective state dataset