Table of Contents
Fetching ...

CMOSE: Comprehensive Multi-Modality Online Student Engagement Dataset with High-Quality Labels

Chi-hsuan Wu, Shih-yang Liu, Xijie Huang, Xingbo Wang, Rong Zhang, Luca Minciullo, Wong Kai Yiu, Kenny Kwan, Kwang-Ting Cheng

TL;DR

This paper addresses the challenge of detecting online student engagement by introducing the CMOSE dataset with psychology-guided, high-quality labels and a multi-modal collection of visual, audio, and chat data. It proposes MocoRank, a MoCo-inspired training mechanism that handles data imbalance, intra-class variation, and ordinal relationships through a Score Pool and a Multi-Margin Loss that leverages relative comparisons. The approach combines high-level visual features with I3D representations and audio cues, yielding superior performance compared with traditional losses; multi-modality improves robustness, and transferability experiments show CMOSE pretraining enables stronger cross-dataset performance. Overall, the work provides a comprehensive dataset, a novel learning objective, and practical insights for building effective, multi-modal engagement recognition systems in online education.

Abstract

Online learning is a rapidly growing industry. However, a major doubt about online learning is whether students are as engaged as they are in face-to-face classes. An engagement recognition system can notify the instructors about the students condition and improve the learning experience. Current challenges in engagement detection involve poor label quality, extreme data imbalance, and intra-class variety - the variety of behaviors at a certain engagement level. To address these problems, we present the CMOSE dataset, which contains a large number of data from different engagement levels and high-quality labels annotated according to psychological advice. We also propose a training mechanism MocoRank to handle the intra-class variety and the ordinal pattern of different degrees of engagement classes. MocoRank outperforms prior engagement detection frameworks, achieving a 1.32% increase in overall accuracy and 5.05% improvement in average accuracy. Further, we demonstrate the effectiveness of multi-modality in engagement detection by combining video features with speech and audio features. The data transferability experiments also state that the proposed CMOSE dataset provides superior label quality and behavior diversity.

CMOSE: Comprehensive Multi-Modality Online Student Engagement Dataset with High-Quality Labels

TL;DR

This paper addresses the challenge of detecting online student engagement by introducing the CMOSE dataset with psychology-guided, high-quality labels and a multi-modal collection of visual, audio, and chat data. It proposes MocoRank, a MoCo-inspired training mechanism that handles data imbalance, intra-class variation, and ordinal relationships through a Score Pool and a Multi-Margin Loss that leverages relative comparisons. The approach combines high-level visual features with I3D representations and audio cues, yielding superior performance compared with traditional losses; multi-modality improves robustness, and transferability experiments show CMOSE pretraining enables stronger cross-dataset performance. Overall, the work provides a comprehensive dataset, a novel learning objective, and practical insights for building effective, multi-modal engagement recognition systems in online education.

Abstract

Online learning is a rapidly growing industry. However, a major doubt about online learning is whether students are as engaged as they are in face-to-face classes. An engagement recognition system can notify the instructors about the students condition and improve the learning experience. Current challenges in engagement detection involve poor label quality, extreme data imbalance, and intra-class variety - the variety of behaviors at a certain engagement level. To address these problems, we present the CMOSE dataset, which contains a large number of data from different engagement levels and high-quality labels annotated according to psychological advice. We also propose a training mechanism MocoRank to handle the intra-class variety and the ordinal pattern of different degrees of engagement classes. MocoRank outperforms prior engagement detection frameworks, achieving a 1.32% increase in overall accuracy and 5.05% improvement in average accuracy. Further, we demonstrate the effectiveness of multi-modality in engagement detection by combining video features with speech and audio features. The data transferability experiments also state that the proposed CMOSE dataset provides superior label quality and behavior diversity.
Paper Structure (29 sections, 13 equations, 4 figures, 8 tables)

This paper contains 29 sections, 13 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: After transferring the gallery video into individual clips, we utilized pre-trained modules to extract visual, audio, and speech features. These features are the input of the model to predict the engagement score. The engagement level is further assigned based on pre-defined thresholds.
  • Figure 2: Various behaviors included in CMOSE Dataset such as nodding, looking down, speaking, and looking away.
  • Figure 3: Model structure and the training mechanism MocoRank. After the model predicts the scores for the batch of videos, the Multi-Margin Loss is calculated by comparing the scores with the triplets in the Score Pool. Next, the model will be updated and the same batch of videos will be sent to the Momentum Encoder to update the Score Pool. Lastly, parts of the weight of the model will be transferred to the weight of the Momentum Encoder.
  • Figure 4: A comparison of model recall on each class using differ- ent loss for training.