Scaling Success: A Systematic Review of Peer Grading Strategies for Accuracy, Efficiency, and Learning in Contemporary Education
Uchswas Paul, Ananya Mantravadi, Jash Shah, Shail Shah, Sri Vaishnavi Mylavarapu, M Parvez Rashid, Edward Gehringer
TL;DR
This study addresses the scalability and reliability of peer grading in large and online courses by conducting a systematic review of 122 peer-reviewed studies from 1981–2024. It introduces a two-dimensional taxonomy—Evaluation Approaches (Summative vs Formative) and Reviewer Weighting Strategies (Calibration, Back-Evaluation, Consensus, Reputation)—to classify peer-grading systems and analyze their effects on accuracy, fairness, workload, and learning. The findings show that ranking and categorization often yield more reliable results than raw ratings, and while consensus, calibration, and reputation weighting can improve accuracy, no single method universally excels; hybrid, staged approaches appear most promising. The paper also highlights a gap in scalable evaluation of formative feedback and calls for further work on long-term learning effects and transparent weighting mechanisms, providing a practical framework for educators and researchers to design more accurate, equitable, and pedagogically meaningful peer grading systems.
Abstract
Peer grading has emerged as a scalable solution for assessment in large and online classrooms, offering both logistical efficiency and pedagogical value. However, designing effective peer-grading systems remains challenging due to persistent concerns around accuracy, fairness, reliability, and student engagement. This paper presents a systematic review of 122 peer-reviewed studies on peer grading spanning over four decades. Drawing from this literature, we propose a comprehensive taxonomy that organizes peer grading systems along two key dimensions: (1) evaluation approaches and (2) reviewer weighting strategies. We analyze how different design choices impact grading accuracy, fairness, student workload, and learning outcomes. Our findings highlight the strengths and limitations of each method. Notably, we found that formative feedback -- often regarded as the most valuable aspect of peer assessment -- is seldom incorporated as a quality-based weighting factor in summative grade synthesis techniques. Furthermore, no single reviewer weighting strategy proves universally optimal; each has its trade-offs. Hybrid strategies that combine multiple techniques could show the greatest promise. Our taxonomy offers a practical framework for educators and researchers aiming to design peer grading systems that are accurate, equitable, and pedagogically meaningful.
