Multimodal Classification of Teaching Activities from University Lecture Recordings
Oscar Sapena, Eva Onaindia
TL;DR
This work tackles the problem of locating specific teaching activities within lengthy university lecture recordings. It proposes a multimodal classifier that jointly leverages audio signals and automated transcription text, using a transformer-based text representation (XLM-RoBERTa) and a Wav2Vec 2.0 audio encoder, fused through BiLSTM streams. The study introduces a labeled taxonomy of teaching activities, builds a Spanish university-lecture dataset (34 recordings, ~3773 minutes), and reports varying success across classes with Miscellaneous achieving the strongest performance (F1 ≈ 0.875) while other activities show more modest scores due to data scarcity and transcription noise. The results demonstrate the feasibility of extracting discourse-type information from multimodal inputs and point to practical benefits for students and instructors by enabling direct access to segments of interest, with future work aimed at improving accuracy and extending language/domain coverage.
Abstract
The way of understanding online higher education has greatly changed due to the worldwide pandemic situation. Teaching is undertaken remotely, and the faculty incorporate lecture audio recordings as part of the teaching material. This new online teaching-learning setting has largely impacted university classes. While online teaching technology that enriches virtual classrooms has been abundant over the past two years, the same has not occurred in supporting students during online learning. {To overcome this limitation, our aim is to work toward enabling students to easily access the piece of the lesson recording in which the teacher explains a theoretical concept, solves an exercise, or comments on organizational issues of the course. To that end, we present a multimodal classification algorithm that identifies the type of activity that is being carried out at any time of the lesson by using a transformer-based language model that exploits features from the audio file and from the automated lecture transcription. The experimental results will show that some academic activities are more easily identifiable with the audio signal while resorting to the text transcription is needed to identify others. All in all, our contribution aims to recognize the academic activities of a teacher during a lesson.
