Towards Student Actions in Classroom Scenes: New Dataset and Baseline
Zhuolin Tan, Chenqiang Gao, Anyong Qin, Ruixin Chen, Tiecheng Song, Feng Yang, Deyu Meng
TL;DR
This work addresses the lack of large-scale, real-world classroom action datasets by introducing the Student Action Video (SAV) dataset, comprising 4,324 clips from 758 classrooms annotated with 15 fine-grained actions in a multi-label, multi-person setting. It proposes a ViT-based baseline augmented with Local Relation Aggregator (LRA) and Window Enhanced Attention (WEA) to better capture small, densely packed actions in classroom scenes, using Faster-RCNN region proposals and RoI pooling. The SAV benchmark reveals real-world challenges such as subtle movements, density, scale variation, occlusion, and varied viewpoints, and the proposed method achieves mAP of 67.90% on SAV and 27.4% on AVA, outperforming several baselines in dense classroom contexts. The dataset and code release aim to spur AI-driven educational tools that can enhance teaching methods and learning outcomes in diverse classrooms.
Abstract
Analyzing student actions is an important and challenging task in educational research. Existing efforts have been hampered by the lack of accessible datasets to capture the nuanced action dynamics in classrooms. In this paper, we present a new multi-label Student Action Video (SAV) dataset, specifically designed for action detection in classroom settings. The SAV dataset consists of 4,324 carefully trimmed video clips from 758 different classrooms, annotated with 15 distinct student actions. Compared to existing action detection datasets, the SAV dataset stands out by providing a wide range of real classroom scenarios, high-quality video data, and unique challenges, including subtle movement differences, dense object engagement, significant scale differences, varied shooting angles, and visual occlusion. These complexities introduce new opportunities and challenges to advance action detection methods. To benchmark this, we propose a novel baseline method based on a visual transformer, designed to enhance attention to key local details within small and dense object regions. Our method demonstrates excellent performance with a mean Average Precision (mAP) of 67.9% and 27.4% on the SAV and AVA datasets, respectively. This paper not only provides the dataset but also calls for further research into AI-driven educational tools that may transform teaching methodologies and learning outcomes. The code and dataset are released at https://github.com/Ritatanz/SAV.
