Table of Contents
Fetching ...

Rendering Transparency to Ranking in Educational Assessment via Bayesian Comparative Judgement

Andy Gray, Alma Rahat, Stephen Lindsay, Jen Pearson, Tom Crick

TL;DR

The paper addresses the lack of transparency in educational assessment by leveraging Bayesian Comparative Judgement (BCJ) and its multi-criteria extension (MBCJ) to produce probabilistic, interpretable rankings. By integrating prior information and providing rank posteriors with uncertainty measures, BCJ improves auditability and accountability, while MBCJ decomposes judgments by multiple learning outcomes for granular insights. The authors demonstrate these approaches on a real UK higher-education dataset, supplemented by questionnaires, workshops, and expert interviews to examine perceived transparency, reliability, and practicality. They find that BCJ enhances consistency and interpretability relative to traditional marking, with MBCJ offering even stronger transparency at the LO level, though feedback mechanisms remain a crucial area for development and broader adoption in high-stakes contexts.

Abstract

Ensuring transparency in educational assessment is increasingly critical, particularly post-pandemic, as demand grows for fairer and more reliable evaluation methods. Comparative Judgement (CJ) offers a promising alternative to traditional assessments, yet concerns remain about its perceived opacity. This paper examines how Bayesian Comparative Judgement (BCJ) enhances transparency by integrating prior information into the judgement process, providing a structured, data-driven approach that improves interpretability and accountability. BCJ assigns probabilities to judgement outcomes, offering quantifiable measures of uncertainty and deeper insights into decision confidence. By systematically tracking how prior data and successive judgements inform final rankings, BCJ clarifies the assessment process and helps identify assessor disagreements. Multi-criteria BCJ extends this by evaluating multiple learning outcomes (LOs) independently, preserving the richness of CJ while producing transparent, granular rankings aligned with specific assessment goals. It also enables a holistic ranking derived from individual LOs, ensuring comprehensive evaluations without compromising detailed feedback. Using a real higher education dataset with professional markers in the UK, we demonstrate BCJ's quantitative rigour and ability to clarify ranking rationales. Through qualitative analysis and discussions with experienced CJ practitioners, we explore its effectiveness in contexts where transparency is crucial, such as high-stakes national assessments. We highlight the benefits and limitations of BCJ, offering insights into its real-world application across various educational settings.

Rendering Transparency to Ranking in Educational Assessment via Bayesian Comparative Judgement

TL;DR

The paper addresses the lack of transparency in educational assessment by leveraging Bayesian Comparative Judgement (BCJ) and its multi-criteria extension (MBCJ) to produce probabilistic, interpretable rankings. By integrating prior information and providing rank posteriors with uncertainty measures, BCJ improves auditability and accountability, while MBCJ decomposes judgments by multiple learning outcomes for granular insights. The authors demonstrate these approaches on a real UK higher-education dataset, supplemented by questionnaires, workshops, and expert interviews to examine perceived transparency, reliability, and practicality. They find that BCJ enhances consistency and interpretability relative to traditional marking, with MBCJ offering even stronger transparency at the LO level, though feedback mechanisms remain a crucial area for development and broader adoption in high-stakes contexts.

Abstract

Ensuring transparency in educational assessment is increasingly critical, particularly post-pandemic, as demand grows for fairer and more reliable evaluation methods. Comparative Judgement (CJ) offers a promising alternative to traditional assessments, yet concerns remain about its perceived opacity. This paper examines how Bayesian Comparative Judgement (BCJ) enhances transparency by integrating prior information into the judgement process, providing a structured, data-driven approach that improves interpretability and accountability. BCJ assigns probabilities to judgement outcomes, offering quantifiable measures of uncertainty and deeper insights into decision confidence. By systematically tracking how prior data and successive judgements inform final rankings, BCJ clarifies the assessment process and helps identify assessor disagreements. Multi-criteria BCJ extends this by evaluating multiple learning outcomes (LOs) independently, preserving the richness of CJ while producing transparent, granular rankings aligned with specific assessment goals. It also enables a holistic ranking derived from individual LOs, ensuring comprehensive evaluations without compromising detailed feedback. Using a real higher education dataset with professional markers in the UK, we demonstrate BCJ's quantitative rigour and ability to clarify ranking rationales. Through qualitative analysis and discussions with experienced CJ practitioners, we explore its effectiveness in contexts where transparency is crucial, such as high-stakes national assessments. We highlight the benefits and limitations of BCJ, offering insights into its real-world application across various educational settings.

Paper Structure

This paper contains 22 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: A flow chart depicting the CJ process. We start with a number of items to rank. Then, based on the budget on how many pairs we can show the assessors, we firstly select a pair to show. Then the assessor would pick the winner, and the statistical method in place would generate a rank for all the items in the light of new evidence. Once the budget is exhausted, we would report the final rank to the assessment owner. The green boxes are the core elements of CJ that varies methodologically between distinct approaches.
  • Figure 2: An illustration of Bayesian updates for a biased coin. The model is a generator for coin flipping outcomes: it will produce heads with the probability specified in the bias. Here, the bias is 0.3 (or 30% chance of observing heads) and we collected data for 50 (simulated) coin flips and updated the prior belief to track the posterior density. Without any observations, the horizontal line at 1 depicts the flat prior belief that the bias could be anything. As we collect more data, the posterior density -- a Beta distribution for the Bernoulli bias variable -- over the bias narrows, i.e. gets confident about the estimation, with a mode around 0.3; the lighter colours are later estimates of the density. The illustration was inspired from the work of sivia2006data.
  • Figure 3: An example of rank density for an item $i$ post BCJ, given $6$ items. Here, this item has the highest probability (of around 50%) of being ranked $6$, but the average (or expected) rank (shown in red dashed vertical line) is around 5.37 due to the consideration of uncertainty arising from paucity of data. BCJ uses the expected ranks (instead of scores in CJ) to determine the final ranks.
  • Figure 4: A radar plot depicting an item's expected rank $\mathbb{E}[r]$ ($5.75, 5.25, 5.25$) performance across three different LOs derived from component wise paired comparisons between 10 items. While conferring the same level of transparency for overall rankings as BCJ, this provides a detailed look into the components and how an individual item performs across the LOs. This helps educators to identify areas where the candidate would possibly need personalised intervention.
  • Figure 5: A histogram of marks for submissions in different groups: candidates for traditional marking, BCJ and MBCJ, entitled as dataset 1, 2 and 3 respectively. Clearly, the groups have similar distribution over the range between 8 and 15; this is important for a fair comparison between the groups.
  • ...and 10 more figures