A Bandit-Based Approach to Educational Recommender Systems: Contextual Thompson Sampling for Learner Skill Gain Optimization

Lukas De Kerpel; Arthur Thuy; Dries F. Benoit

A Bandit-Based Approach to Educational Recommender Systems: Contextual Thompson Sampling for Learner Skill Gain Optimization

Lukas De Kerpel, Arthur Thuy, Dries F. Benoit

TL;DR

This work addresses scalable personalization of practice in digital OR/MS/Analytics education by recasting exercise sequencing as a contextual bandit problem that optimizes for skill gain. It introduces Linear Thompson Sampling (LinTS) and compares it to non-contextual Thompson Sampling and collaborative filtering baselines, with rewards defined as the change in skill mastery estimated via Bayesian Knowledge Tracing. Using the ASSISTments 2017 dataset, LinTS outperforms all baselines, achieving a final average skill-gain reward of $0.198$ and demonstrating favorable exploration–exploitation dynamics that concentrate recommendations on high-value exercises. The approach offers adaptive, data-driven guidance for instructors, highlighting effective exercises and enabling personalized remediation at scale, with potential extensions to richer context and multi-objective educational goals.

Abstract

In recent years, instructional practices in Operations Research (OR), Management Science (MS), and Analytics have increasingly shifted toward digital environments, where large and diverse groups of learners make it difficult to provide practice that adapts to individual needs. This paper introduces a method that generates personalized sequences of exercises by selecting, at each step, the exercise most likely to advance a learner's understanding of a targeted skill. The method uses information about the learner and their past performance to guide these choices, and learning progress is measured as the change in estimated skill level before and after each exercise. Using data from an online mathematics tutoring platform, we find that the approach recommends exercises associated with greater skill improvement and adapts effectively to differences across learners. From an instructional perspective, the framework enables personalized practice at scale, highlights exercises with consistently strong learning value, and helps instructors identify learners who may benefit from additional support.

A Bandit-Based Approach to Educational Recommender Systems: Contextual Thompson Sampling for Learner Skill Gain Optimization

TL;DR

and demonstrating favorable exploration–exploitation dynamics that concentrate recommendations on high-value exercises. The approach offers adaptive, data-driven guidance for instructors, highlighting effective exercises and enabling personalized remediation at scale, with potential extensions to richer context and multi-objective educational goals.

Abstract

Paper Structure (19 sections, 8 equations, 5 figures, 3 tables, 4 algorithms)

This paper contains 19 sections, 8 equations, 5 figures, 3 tables, 4 algorithms.

Introduction
Related work
Methodology
Multi-Armed Bandits in Educational Recommendation
Collaborative filtering baselines
UserCF
ItemCF
Bandit policies
TS
LinTS
Experimental Setup
Dataset.
Data Preprocessing.
Data splitting.
Algorithms.
...and 4 more sections

Figures (5)

Figure 1: Bandit feedback process in an ERS. The environment (learning platform) emits context $\mathbf{x}_t$; the agent (bandit policy) recommends an exercise $a_t$; after the learner engages, the environment returns reward $r_{t,a_t}$ (skill gain). The resulting tuples $(\mathbf{x}_t,a_t,r_{t,a_t})$ support online learning and evaluation.
Figure 2: Overview of dataset characteristics: (a) distribution of skill-gain rewards, and (b) variability in student activity levels.
Figure 3: Cumulative average reward on the held-out test set. LinTS outperforms all non-contextual baselines, including TS and CF baselines, underscoring the value of contextual modeling in adaptive educational recommendation.
Figure 4: Exercise selection frequency distributions during testing across the four best-performing agents. Contextual modeling (LinTS) concentrates selections on a narrower set of informative exercises, whereas non-contextual strategies spread choices more diffusely across the exercise space.
Figure 5: Exercise selection frequency distributions for LinTS ($v=0.05$) during training. Subfigure \ref{['fig:lints_first10k']} shows early exploration behavior, while subfigure \ref{['fig:lints_last10k']} illustrates later-stage exploitation dynamics.

A Bandit-Based Approach to Educational Recommender Systems: Contextual Thompson Sampling for Learner Skill Gain Optimization

TL;DR

Abstract

A Bandit-Based Approach to Educational Recommender Systems: Contextual Thompson Sampling for Learner Skill Gain Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (5)