Towards Student Actions in Classroom Scenes: New Dataset and Baseline

Zhuolin Tan; Chenqiang Gao; Anyong Qin; Ruixin Chen; Tiecheng Song; Feng Yang; Deyu Meng

Towards Student Actions in Classroom Scenes: New Dataset and Baseline

Zhuolin Tan, Chenqiang Gao, Anyong Qin, Ruixin Chen, Tiecheng Song, Feng Yang, Deyu Meng

TL;DR

This work addresses the lack of large-scale, real-world classroom action datasets by introducing the Student Action Video (SAV) dataset, comprising 4,324 clips from 758 classrooms annotated with 15 fine-grained actions in a multi-label, multi-person setting. It proposes a ViT-based baseline augmented with Local Relation Aggregator (LRA) and Window Enhanced Attention (WEA) to better capture small, densely packed actions in classroom scenes, using Faster-RCNN region proposals and RoI pooling. The SAV benchmark reveals real-world challenges such as subtle movements, density, scale variation, occlusion, and varied viewpoints, and the proposed method achieves mAP of 67.90% on SAV and 27.4% on AVA, outperforming several baselines in dense classroom contexts. The dataset and code release aim to spur AI-driven educational tools that can enhance teaching methods and learning outcomes in diverse classrooms.

Abstract

Analyzing student actions is an important and challenging task in educational research. Existing efforts have been hampered by the lack of accessible datasets to capture the nuanced action dynamics in classrooms. In this paper, we present a new multi-label Student Action Video (SAV) dataset, specifically designed for action detection in classroom settings. The SAV dataset consists of 4,324 carefully trimmed video clips from 758 different classrooms, annotated with 15 distinct student actions. Compared to existing action detection datasets, the SAV dataset stands out by providing a wide range of real classroom scenarios, high-quality video data, and unique challenges, including subtle movement differences, dense object engagement, significant scale differences, varied shooting angles, and visual occlusion. These complexities introduce new opportunities and challenges to advance action detection methods. To benchmark this, we propose a novel baseline method based on a visual transformer, designed to enhance attention to key local details within small and dense object regions. Our method demonstrates excellent performance with a mean Average Precision (mAP) of 67.9% and 27.4% on the SAV and AVA datasets, respectively. This paper not only provides the dataset but also calls for further research into AI-driven educational tools that may transform teaching methodologies and learning outcomes. The code and dataset are released at https://github.com/Ritatanz/SAV.

Towards Student Actions in Classroom Scenes: New Dataset and Baseline

TL;DR

Abstract

Paper Structure (17 sections, 10 equations, 15 figures, 6 tables)

This paper contains 17 sections, 10 equations, 15 figures, 6 tables.

Introduction
Related work
Action recognition and detection datasets
Video action detection
The Student Action Video Dataset
Data collection
Data statistics
Data characteristics
Action detection model
Local Relation Aggregator
Window Enhanced Attention
experiments
Datasets and Metrics
Action Detection Results
Ablation study
...and 2 more sections

Figures (15)

Figure 1: The bounding boxes and action annotations in sample frames of our dataset. Each frame is cropped for zooming in to show the annotations better. Each person has a postural action (in orange), a sight action (in green), person-object interactions (in blue), body-motion actions (in purple), and person-person interactions (in red) annotated when they occur. Note that only keyframes are shown here, and accurate annotation of actions requires temporal context.
Figure 2: Each row shows samples from J-HMDB jhuang2013towards, UCF101-24 soomro2012ucf101, AVA gu2018ava, MultiSports li2021multisports and SAV, respectively. (a) J-HMDB: each video contains one person for a single label. (b) UCF101-24: same as above. (c) AVA: contains multiple persons with multiple labels. (d) MutiSports: contains multiple persons, each with a sports label. (e) SAV: contains dense persons with multiple labels. (Note that: The data from Sun et al. sun2021student are not yet publicly available.)
Figure 3: Statistics of each class in the SAV dataset, which is sorted by descending order. Blue for pose actions, green for sight actions, orange for person-object interactions, purple for body-motion actions, and light purple for person-person interactions.
Figure 4: Comparison of the characteristics of bounding boxes. First row: The X-axis implies the ratio of bounding box area w.r.t. video frame. The Y-axis implies the normalized density of bounding box occurrences. Second row: The X-axis implies the aspect ratio of the bounding box area (height/width). The Y-axis implies the normalized density of the bounding box occurrences.
Figure 5: First row: the different educational stages of classrooms in SAV: kindergarten, elementary school, and middle school. Second row: the different course scenarios in SAV, such as math, chemistry, and physics.
...and 10 more figures

Towards Student Actions in Classroom Scenes: New Dataset and Baseline

TL;DR

Abstract

Towards Student Actions in Classroom Scenes: New Dataset and Baseline

Authors

TL;DR

Abstract

Table of Contents

Figures (15)