Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

Michael Hardy

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

Michael Hardy

TL;DR

It is quantitatively illustrated that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance, and that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs.

Abstract

Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37--a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful scoring such as tokenizer vocabulary size, which exhibits diminishing returns--potentially due to undertrained tokens. Findings argue for systems design which better anticipates known statistical shortcomings of autoregressive models. Finally, we provide additional experiments to illustrate wording and tokenization sensitivity and bias elicitation in high-stakes education contexts, where LLMs demonstrate racial discrimination. Code and data for this study are available.

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

TL;DR

Abstract

Paper Structure (53 sections, 2 equations, 3 figures, 10 tables)

This paper contains 53 sections, 2 equations, 3 figures, 10 tables.

Introduction and motivation
Background
Sensitivity to Tokenization and Wording
Scoring of Student Work
LLM responses matter: Biases in LLMs
Data
Methods
Meta-analytic Study
QWK outcome
Effect size
Predictors
Hierarchical meta-regression
Multilevel meta-modeling
Baseline and increasingly controlled specifications
Bayesian estimation of the maximal stable model
...and 38 more sections

Figures (3)

Figure 1: LM Benchmarks over time from kiela_plotting_2023. Short Answer Scoring of K12 student wring on the ASAP-SAS dataset, in purple, are added using the same scaling: human score is set to 0 and baseline performance to -1, and results, X, are scaled as $(X-\text{Human})/|\text{Baseline}-\text{Human}|$. Human and 2012 Baseline results are from shermis_contrasting_2015. Point (A) represents the best LLM model, a transformer ensemble, ormerod_short-answer_2022. Point (B) Best GPT model ensemble, a fine-tuned ensemble, ormerod_automated_2024. Point (C) [not plotted] is the best implementation using GPT-4 jiang_short_2024 with prompt engineering and would have a value of -1.52 on this scale. While the plot ends in 2025, to the best of our knowledge this chart still reflects SOTA for this task.
Figure 2: Khanmigo IEP Assistant Entry Page
Figure 3: QWK distribution of LLMs on ASAP-SAS dataset for meta-analysis models relative to human performance

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

TL;DR

Abstract

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

Authors

TL;DR

Abstract

Table of Contents

Figures (3)