CMOSE: Comprehensive Multi-Modality Online Student Engagement Dataset with High-Quality Labels
Chi-hsuan Wu, Shih-yang Liu, Xijie Huang, Xingbo Wang, Rong Zhang, Luca Minciullo, Wong Kai Yiu, Kenny Kwan, Kwang-Ting Cheng
TL;DR
This paper addresses the challenge of detecting online student engagement by introducing the CMOSE dataset with psychology-guided, high-quality labels and a multi-modal collection of visual, audio, and chat data. It proposes MocoRank, a MoCo-inspired training mechanism that handles data imbalance, intra-class variation, and ordinal relationships through a Score Pool and a Multi-Margin Loss that leverages relative comparisons. The approach combines high-level visual features with I3D representations and audio cues, yielding superior performance compared with traditional losses; multi-modality improves robustness, and transferability experiments show CMOSE pretraining enables stronger cross-dataset performance. Overall, the work provides a comprehensive dataset, a novel learning objective, and practical insights for building effective, multi-modal engagement recognition systems in online education.
Abstract
Online learning is a rapidly growing industry. However, a major doubt about online learning is whether students are as engaged as they are in face-to-face classes. An engagement recognition system can notify the instructors about the students condition and improve the learning experience. Current challenges in engagement detection involve poor label quality, extreme data imbalance, and intra-class variety - the variety of behaviors at a certain engagement level. To address these problems, we present the CMOSE dataset, which contains a large number of data from different engagement levels and high-quality labels annotated according to psychological advice. We also propose a training mechanism MocoRank to handle the intra-class variety and the ordinal pattern of different degrees of engagement classes. MocoRank outperforms prior engagement detection frameworks, achieving a 1.32% increase in overall accuracy and 5.05% improvement in average accuracy. Further, we demonstrate the effectiveness of multi-modality in engagement detection by combining video features with speech and audio features. The data transferability experiments also state that the proposed CMOSE dataset provides superior label quality and behavior diversity.
