Table of Contents
Fetching ...

Adaptation of the Multi-Concept Multivariate Elo Rating System to Medical Students Training Data

Erva Nihan Kandemir, Jill-Jenn Vie, Adam Sanchez-Ayte, Olivier Palombi, Franck Ramus

TL;DR

This work evaluates a multi-concept Elo rating framework on the BNE medical training platform to predict student performance and question difficulty in a large, sparse, multi-specialty setting. It introduces guessing behavior, dynamic uncertainty, and multi-knowledge-component extensions to Elo, and demonstrates that Elo achieves comparable predictive accuracy to logistic regression on mock exams while enabling real-time, interpretable knowledge tracing. Initializing Elo with prior-year logistic regression data accelerates early convergence and improves early accuracy, highlighting a practical path for online adaptive learning. The study also discusses data characteristics, limitations, and directions for future work, including forgetting curves and online recommendation strategies.

Abstract

Accurate estimation of question difficulty and prediction of student performance play key roles in optimizing educational instruction and enhancing learning outcomes within digital learning platforms. The Elo rating system is widely recognized for its proficiency in predicting student performance by estimating both question difficulty and student ability while providing computational efficiency and real-time adaptivity. This paper presents an adaptation of a multi concept variant of the Elo rating system to the data collected by a medical training platform, a platform characterized by a vast knowledge corpus, substantial inter-concept overlap, a huge question bank with significant sparsity in user question interactions, and a highly diverse user population, presenting unique challenges. Our study is driven by two primary objectives: firstly, to comprehensively evaluate the Elo rating systems capabilities on this real-life data, and secondly, to tackle the issue of imprecise early stage estimations when implementing the Elo rating system for online assessments. Our findings suggest that the Elo rating system exhibits comparable accuracy to the well-established logistic regression model in predicting final exam outcomes for users within our digital platform. Furthermore, results underscore that initializing Elo rating estimates with historical data remarkably reduces errors and enhances prediction accuracy, especially during the initial phases of student interactions.

Adaptation of the Multi-Concept Multivariate Elo Rating System to Medical Students Training Data

TL;DR

This work evaluates a multi-concept Elo rating framework on the BNE medical training platform to predict student performance and question difficulty in a large, sparse, multi-specialty setting. It introduces guessing behavior, dynamic uncertainty, and multi-knowledge-component extensions to Elo, and demonstrates that Elo achieves comparable predictive accuracy to logistic regression on mock exams while enabling real-time, interpretable knowledge tracing. Initializing Elo with prior-year logistic regression data accelerates early convergence and improves early accuracy, highlighting a practical path for online adaptive learning. The study also discusses data characteristics, limitations, and directions for future work, including forgetting curves and online recommendation strategies.

Abstract

Accurate estimation of question difficulty and prediction of student performance play key roles in optimizing educational instruction and enhancing learning outcomes within digital learning platforms. The Elo rating system is widely recognized for its proficiency in predicting student performance by estimating both question difficulty and student ability while providing computational efficiency and real-time adaptivity. This paper presents an adaptation of a multi concept variant of the Elo rating system to the data collected by a medical training platform, a platform characterized by a vast knowledge corpus, substantial inter-concept overlap, a huge question bank with significant sparsity in user question interactions, and a highly diverse user population, presenting unique challenges. Our study is driven by two primary objectives: firstly, to comprehensively evaluate the Elo rating systems capabilities on this real-life data, and secondly, to tackle the issue of imprecise early stage estimations when implementing the Elo rating system for online assessments. Our findings suggest that the Elo rating system exhibits comparable accuracy to the well-established logistic regression model in predicting final exam outcomes for users within our digital platform. Furthermore, results underscore that initializing Elo rating estimates with historical data remarkably reduces errors and enhances prediction accuracy, especially during the initial phases of student interactions.
Paper Structure (22 sections, 9 equations, 6 figures, 2 tables)

This paper contains 22 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Number of questions that require knowledge on any given number of medical specialties. The questions exhibit a spectrum of dependence on medical specialty knowledge for their solution. While a considerable portion of questions rely on ability in a single medical specialty, many questions require knowledge spanning multiple specialties.
  • Figure 2: Overview of the use of the BNE Platform during the 2020-2021 educational year. The blue bars represent the count of unique users per medical specialty in the data set. The orange bars represent the count of unique questions available in the platform for each specialty. In addition, the overlaid line plot illustrates 'Attempts per Specialty,' the total number of user attempts on questions within each specialty during the educational year 2020-2021.
  • Figure 3: Number of Attempts by Each User and to Each Question across the 31 Medical Specialties. The top left box plot shows the distributions of the number of attempts by each user across the 31 specialties. The bottom left box plot depicts the number of attempts to questions in each specialty. During the mock exam, all students took identical questions, resulting in quasi-uniform numbers of attempts given by students and received by questions (top and bottom right plots).
  • Figure 4: Comparing Logistic Regression and Elo Rating outcomes for question difficulty (left) and user ability across 31 specialties (right) in the same 2020-2021 dataset. Scatter plots illustrate the alignment, with $y=x$ lines for reference. The left plot displays Logistic Regression difficulty estimates on the $x$-axis and Elo Rating estimates on the $y$-axis. On the right, the plot contrasts user ability estimates, with Logistic Regression on the $x$-axis and Elo Rating on the $y$-axis. Sample sizes ($N$) are included in each plot.
  • Figure 5: Comparing Logistic Regression Outcomes. Estimated question difficulty (left) and user ability on each of 31 specialties (right) in the two successive education years (2019-2020 and 2020-2021) using the Logistic Regression model. $y=x$ lines are given for reference.
  • ...and 1 more figures