AI-generated Essays: Characteristics and Implications on Automated Scoring and Academic Integrity
Yang Zhong, Jiangang Hao, Michael Fauss, Chen Li, Yuan Wang
TL;DR
The paper investigates how AI-generated essays affect automated scoring and academic integrity in GRE Analytical Writing by conducting a large-scale, cross-model study with 10 LLMs and human benchmarks. It analyzes essay characteristics, scoring alignment between human raters and e-rater, and the detectability of AI-generated texts using language and perplexity features, including both within-model and cross-model scenarios. Findings show AI essays often receive higher e-rater scores than humans, while detection methods generalize across models to a notable extent, though cross-model transfer remains imperfect. The work highlights the need to expand scoring features to capture deeper reasoning and to develop robust detection strategies as AI-assisted writing becomes pervasive in educational contexts.
Abstract
The rapid advancement of large language models (LLMs) has enabled the generation of coherent essays, making AI-assisted writing increasingly common in educational and professional settings. Using large-scale empirical data, we examine and benchmark the characteristics and quality of essays generated by popular LLMs and discuss their implications for two key components of writing assessments: automated scoring and academic integrity. Our findings highlight limitations in existing automated scoring systems, such as e-rater, when applied to essays generated or heavily influenced by AI, and identify areas for improvement, including the development of new features to capture deeper thinking and recalibrating feature weights. Despite growing concerns that the increasing variety of LLMs may undermine the feasibility of detecting AI-generated essays, our results show that detectors trained on essays generated from one model can often identify texts from others with high accuracy, suggesting that effective detection could remain manageable in practice.
