Table of Contents
Fetching ...

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

Michael Hardy

TL;DR

It is quantitatively illustrated that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance, and that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs.

Abstract

Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37--a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful scoring such as tokenizer vocabulary size, which exhibits diminishing returns--potentially due to undertrained tokens. Findings argue for systems design which better anticipates known statistical shortcomings of autoregressive models. Finally, we provide additional experiments to illustrate wording and tokenization sensitivity and bias elicitation in high-stakes education contexts, where LLMs demonstrate racial discrimination. Code and data for this study are available.

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

TL;DR

It is quantitatively illustrated that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance, and that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs.

Abstract

Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37--a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful scoring such as tokenizer vocabulary size, which exhibits diminishing returns--potentially due to undertrained tokens. Findings argue for systems design which better anticipates known statistical shortcomings of autoregressive models. Finally, we provide additional experiments to illustrate wording and tokenization sensitivity and bias elicitation in high-stakes education contexts, where LLMs demonstrate racial discrimination. Code and data for this study are available.
Paper Structure (53 sections, 2 equations, 3 figures, 10 tables)

This paper contains 53 sections, 2 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: LM Benchmarks over time from kiela_plotting_2023. Short Answer Scoring of K12 student wring on the ASAP-SAS dataset, in purple, are added using the same scaling: human score is set to 0 and baseline performance to -1, and results, X, are scaled as $(X-\text{Human})/|\text{Baseline}-\text{Human}|$. Human and 2012 Baseline results are from shermis_contrasting_2015. Point (A) represents the best LLM model, a transformer ensemble, ormerod_short-answer_2022. Point (B) Best GPT model ensemble, a fine-tuned ensemble, ormerod_automated_2024. Point (C) [not plotted] is the best implementation using GPT-4 jiang_short_2024 with prompt engineering and would have a value of -1.52 on this scale. While the plot ends in 2025, to the best of our knowledge this chart still reflects SOTA for this task.
  • Figure 2: Khanmigo IEP Assistant Entry Page
  • Figure 3: QWK distribution of LLMs on ASAP-SAS dataset for meta-analysis models relative to human performance