Table of Contents
Fetching ...

A Toolbox for Modelling Engagement with Educational Videos

Yuxiang Qiu, Karim Djemili, Denis Elezi, Aaneel Shalman, María Pérez-Ortiz, Emine Yilmaz, John Shawe-Taylor, Sahan Bulathwela

TL;DR

The paper tackles the scarcity of public data and tools for modelling learner engagement with educational videos by introducing PEEKC, a large in-the-wild dataset that links video fragments to Wikipedia-based Knowledge Components via Wikification, and TrueLearn, an open-source library of online Bayesian learner models with interpretable Open Learner Model visualisations. PEEKC uses ~5-minute video fragments, PageRank and Cosine Similarity to rank KCs, and a 0.75 watch-time threshold to produce binary engagement labels, resulting in a dataset of 290,535 interactions across 20,019 users. The TrueLearn library provides a modular, scalable framework (Datasets, Pre-processing, Models, Learning, Metrics, Visualisations) for online learning, including three models (Interest, Novelty, INK) that integrate learner state factors and offer nine visualisation styles. Empirical results show TrueLearn models frequently outperform content-based and KT baselines, with strong data efficiency and real-time state updates, illustrating the potential for practical AI-driven personalisation of educational videos and enabling researchers to extend the toolkit to broader modalities and platforms.

Abstract

With the advancement and utility of Artificial Intelligence (AI), personalising education to a global population could be a cornerstone of new educational systems in the future. This work presents the PEEKC dataset and the TrueLearn Python library, which contains a dataset and a series of online learner state models that are essential to facilitate research on learner engagement modelling.TrueLearn family of models was designed following the "open learner" concept, using humanly-intuitive user representations. This family of scalable, online models also help end-users visualise the learner models, which may in the future facilitate user interaction with their models/recommenders. The extensive documentation and coding examples make the library highly accessible to both machine learning developers and educational data mining and learning analytics practitioners. The experiments show the utility of both the dataset and the library with predictive performance significantly exceeding comparative baseline models. The dataset contains a large amount of AI-related educational videos, which are of interest for building and validating AI-specific educational recommenders.

A Toolbox for Modelling Engagement with Educational Videos

TL;DR

The paper tackles the scarcity of public data and tools for modelling learner engagement with educational videos by introducing PEEKC, a large in-the-wild dataset that links video fragments to Wikipedia-based Knowledge Components via Wikification, and TrueLearn, an open-source library of online Bayesian learner models with interpretable Open Learner Model visualisations. PEEKC uses ~5-minute video fragments, PageRank and Cosine Similarity to rank KCs, and a 0.75 watch-time threshold to produce binary engagement labels, resulting in a dataset of 290,535 interactions across 20,019 users. The TrueLearn library provides a modular, scalable framework (Datasets, Pre-processing, Models, Learning, Metrics, Visualisations) for online learning, including three models (Interest, Novelty, INK) that integrate learner state factors and offer nine visualisation styles. Empirical results show TrueLearn models frequently outperform content-based and KT baselines, with strong data efficiency and real-time state updates, illustrating the potential for practical AI-driven personalisation of educational videos and enabling researchers to extend the toolkit to broader modalities and platforms.

Abstract

With the advancement and utility of Artificial Intelligence (AI), personalising education to a global population could be a cornerstone of new educational systems in the future. This work presents the PEEKC dataset and the TrueLearn Python library, which contains a dataset and a series of online learner state models that are essential to facilitate research on learner engagement modelling.TrueLearn family of models was designed following the "open learner" concept, using humanly-intuitive user representations. This family of scalable, online models also help end-users visualise the learner models, which may in the future facilitate user interaction with their models/recommenders. The extensive documentation and coding examples make the library highly accessible to both machine learning developers and educational data mining and learning analytics practitioners. The experiments show the utility of both the dataset and the library with predictive performance significantly exceeding comparative baseline models. The dataset contains a large amount of AI-related educational videos, which are of interest for building and validating AI-specific educational recommenders.
Paper Structure (51 sections, 7 equations, 9 figures, 5 tables)

This paper contains 51 sections, 7 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: (i) Visual representation of the data items available in the PEEKC dataset where each video is broken into multiple, non-overlapping 5-minute fragments that are linked with ranked Wikipedia-based KCs and (ii) The flow chart presenting how the video data and the learner interaction logs from VLN repository are processed to create the PEEKC dataset.
  • Figure 2: Characteristics of the PEEKC dataset: (i) number of learners in the training/test dataset based on the number of events in their sessions and (ii) wordcloud depicting the most frequent Wikipedia-based KCs showing the dominance of AI and ML concepts in the dataset.
  • Figure 3: The 15 most knowledge acquired KCs of a learner in a bubble plot. The size of the circle aligns with the KC mean and the intensity of the colour maps to the variance.
  • Figure 4: Visual illustration of the problem setting where learner $\ell$, with knowledge (that allows them to tackle novel content) $\theta_\texttt{NK}$ and interests $\theta_\texttt{I}$ is watching fragments of educational videos $r_x$ containing different knowledge components $K_{r_x}$ over time $t$.
  • Figure 5: Predictive performance on PEEKC dataset test data in terms of Precision (left), Recall (middle) and F1-Score (right) for the benchmark models when varying numbers of Knowledge Components (KCs) are used as the content representation. Higher number of topics did not increase the performance of the Cosine and Jaccard models significantly to reach TrueLearn Novelty model
  • ...and 4 more figures