Table of Contents
Fetching ...

Multimodal Classification of Teaching Activities from University Lecture Recordings

Oscar Sapena, Eva Onaindia

TL;DR

This work tackles the problem of locating specific teaching activities within lengthy university lecture recordings. It proposes a multimodal classifier that jointly leverages audio signals and automated transcription text, using a transformer-based text representation (XLM-RoBERTa) and a Wav2Vec 2.0 audio encoder, fused through BiLSTM streams. The study introduces a labeled taxonomy of teaching activities, builds a Spanish university-lecture dataset (34 recordings, ~3773 minutes), and reports varying success across classes with Miscellaneous achieving the strongest performance (F1 ≈ 0.875) while other activities show more modest scores due to data scarcity and transcription noise. The results demonstrate the feasibility of extracting discourse-type information from multimodal inputs and point to practical benefits for students and instructors by enabling direct access to segments of interest, with future work aimed at improving accuracy and extending language/domain coverage.

Abstract

The way of understanding online higher education has greatly changed due to the worldwide pandemic situation. Teaching is undertaken remotely, and the faculty incorporate lecture audio recordings as part of the teaching material. This new online teaching-learning setting has largely impacted university classes. While online teaching technology that enriches virtual classrooms has been abundant over the past two years, the same has not occurred in supporting students during online learning. {To overcome this limitation, our aim is to work toward enabling students to easily access the piece of the lesson recording in which the teacher explains a theoretical concept, solves an exercise, or comments on organizational issues of the course. To that end, we present a multimodal classification algorithm that identifies the type of activity that is being carried out at any time of the lesson by using a transformer-based language model that exploits features from the audio file and from the automated lecture transcription. The experimental results will show that some academic activities are more easily identifiable with the audio signal while resorting to the text transcription is needed to identify others. All in all, our contribution aims to recognize the academic activities of a teacher during a lesson.

Multimodal Classification of Teaching Activities from University Lecture Recordings

TL;DR

This work tackles the problem of locating specific teaching activities within lengthy university lecture recordings. It proposes a multimodal classifier that jointly leverages audio signals and automated transcription text, using a transformer-based text representation (XLM-RoBERTa) and a Wav2Vec 2.0 audio encoder, fused through BiLSTM streams. The study introduces a labeled taxonomy of teaching activities, builds a Spanish university-lecture dataset (34 recordings, ~3773 minutes), and reports varying success across classes with Miscellaneous achieving the strongest performance (F1 ≈ 0.875) while other activities show more modest scores due to data scarcity and transcription noise. The results demonstrate the feasibility of extracting discourse-type information from multimodal inputs and point to practical benefits for students and instructors by enabling direct access to segments of interest, with future work aimed at improving accuracy and extending language/domain coverage.

Abstract

The way of understanding online higher education has greatly changed due to the worldwide pandemic situation. Teaching is undertaken remotely, and the faculty incorporate lecture audio recordings as part of the teaching material. This new online teaching-learning setting has largely impacted university classes. While online teaching technology that enriches virtual classrooms has been abundant over the past two years, the same has not occurred in supporting students during online learning. {To overcome this limitation, our aim is to work toward enabling students to easily access the piece of the lesson recording in which the teacher explains a theoretical concept, solves an exercise, or comments on organizational issues of the course. To that end, we present a multimodal classification algorithm that identifies the type of activity that is being carried out at any time of the lesson by using a transformer-based language model that exploits features from the audio file and from the automated lecture transcription. The experimental results will show that some academic activities are more easily identifiable with the audio signal while resorting to the text transcription is needed to identify others. All in all, our contribution aims to recognize the academic activities of a teacher during a lesson.
Paper Structure (17 sections, 10 figures, 7 tables)

This paper contains 17 sections, 10 figures, 7 tables.

Figures (10)

  • Figure S1: Hierarchy of academic labels.
  • Figure S2: Audio sample of a Miscellaneous segment followed by Indistinct Chat.
  • Figure S3: Audio sample of a Digression segment.
  • Figure S4: Audio sample of an Organization segment.
  • Figure S5: Audio sample of an Interaction segment.
  • ...and 5 more figures