S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

Tasfia Seuti; Sagnik Ray Choudhury

S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

Tasfia Seuti, Sagnik Ray Choudhury

TL;DR

This work introduces S-GRADES (Studying Generalization of Student Response Assessments in Diverse Evaluative Settings), a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and reproducible evaluation protocols, enabling continuous integration of new datasets and evaluation settings.

Abstract

Evaluating student responses, from long essays to short factual answers, is a key challenge in educational NLP. Automated Essay Scoring (AES) focuses on holistic writing qualities such as coherence and argumentation, while Automatic Short Answer Grading (ASAG) emphasizes factual correctness and conceptual understanding. Despite their shared goal, these paradigms have progressed in isolation with fragmented datasets, inconsistent metrics, and separate communities. We introduce S-GRADES (Studying Generalization of Student Response Assessments in Diverse Evaluative Settings), a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and reproducible evaluation protocols. The benchmark is fully open-source and designed for extensibility, enabling continuous integration of new datasets and evaluation settings. To demonstrate the utility of S-GRADES, we evaluate three state-of-the-art large language models across the benchmark using multiple reasoning strategies in prompting. We further examine the effects of exemplar selection and cross-dataset exemplar transfer. Our analyses illustrate how benchmark-driven evaluation reveals reliability and generalization gaps across essay and short-answer grading tasks, highlighting the importance of standardized, cross-paradigm assessment.

S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

TL;DR

Abstract

Paper Structure (31 sections, 7 equations, 13 figures, 4 tables)

This paper contains 31 sections, 7 equations, 13 figures, 4 tables.

Introduction
Related Work
S-GRADES Benchmark Datasets
Benchmark Development
Preprocessing & Standardization
Platform Architecture
Evaluation Metrics
Experimental Setup
Results and Discussion
Prediction Stability
Exemplar Selection Stability
Exemplar Generalization
Conclusion and Future Work
Limitations
Comprehensive Reasoning Strategy Analysis
...and 16 more sections

Figures (13)

Figure 1: Overview of the S-GRADES benchmark user workflow.
Figure 2: Complete benchmark submission interface.
Figure 3: Public leaderboard displaying aggregated results across all datasets and evaluation metrics.
Figure 4: QWK scores for AES datasets across models and reasoning strategies.
Figure 5: QWK scores for ASAG regression datasets across models and reasoning strategies.
...and 8 more figures

S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

TL;DR

Abstract

S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

Authors

TL;DR

Abstract

Table of Contents

Figures (13)