Table of Contents
Fetching ...

AI-generated Essays: Characteristics and Implications on Automated Scoring and Academic Integrity

Yang Zhong, Jiangang Hao, Michael Fauss, Chen Li, Yuan Wang

TL;DR

The paper investigates how AI-generated essays affect automated scoring and academic integrity in GRE Analytical Writing by conducting a large-scale, cross-model study with 10 LLMs and human benchmarks. It analyzes essay characteristics, scoring alignment between human raters and e-rater, and the detectability of AI-generated texts using language and perplexity features, including both within-model and cross-model scenarios. Findings show AI essays often receive higher e-rater scores than humans, while detection methods generalize across models to a notable extent, though cross-model transfer remains imperfect. The work highlights the need to expand scoring features to capture deeper reasoning and to develop robust detection strategies as AI-assisted writing becomes pervasive in educational contexts.

Abstract

The rapid advancement of large language models (LLMs) has enabled the generation of coherent essays, making AI-assisted writing increasingly common in educational and professional settings. Using large-scale empirical data, we examine and benchmark the characteristics and quality of essays generated by popular LLMs and discuss their implications for two key components of writing assessments: automated scoring and academic integrity. Our findings highlight limitations in existing automated scoring systems, such as e-rater, when applied to essays generated or heavily influenced by AI, and identify areas for improvement, including the development of new features to capture deeper thinking and recalibrating feature weights. Despite growing concerns that the increasing variety of LLMs may undermine the feasibility of detecting AI-generated essays, our results show that detectors trained on essays generated from one model can often identify texts from others with high accuracy, suggesting that effective detection could remain manageable in practice.

AI-generated Essays: Characteristics and Implications on Automated Scoring and Academic Integrity

TL;DR

The paper investigates how AI-generated essays affect automated scoring and academic integrity in GRE Analytical Writing by conducting a large-scale, cross-model study with 10 LLMs and human benchmarks. It analyzes essay characteristics, scoring alignment between human raters and e-rater, and the detectability of AI-generated texts using language and perplexity features, including both within-model and cross-model scenarios. Findings show AI essays often receive higher e-rater scores than humans, while detection methods generalize across models to a notable extent, though cross-model transfer remains imperfect. The work highlights the need to expand scoring features to capture deeper reasoning and to develop robust detection strategies as AI-assisted writing becomes pervasive in educational contexts.

Abstract

The rapid advancement of large language models (LLMs) has enabled the generation of coherent essays, making AI-assisted writing increasingly common in educational and professional settings. Using large-scale empirical data, we examine and benchmark the characteristics and quality of essays generated by popular LLMs and discuss their implications for two key components of writing assessments: automated scoring and academic integrity. Our findings highlight limitations in existing automated scoring systems, such as e-rater, when applied to essays generated or heavily influenced by AI, and identify areas for improvement, including the development of new features to capture deeper thinking and recalibrating feature weights. Despite growing concerns that the increasing variety of LLMs may undermine the feasibility of detecting AI-generated essays, our results show that detectors trained on essays generated from one model can often identify texts from others with high accuracy, suggesting that effective detection could remain manageable in practice.

Paper Structure

This paper contains 18 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Distribution of the length of essays generated by different LLMs.
  • Figure 2: Semantic similarity between pairs of essays generated by the same LLM and prompt.
  • Figure 3: Verbatim similarity between pairs of essays generated by the same LLM and prompt.
  • Figure 4: Comparison of the language features of the essays from different LLMs and humans. The ones in blue colors are models in 2024 and the ones in green colors are those from 2023. The original values of each feature from multiple LLMs have been rescaled using MinMax scaling, which transforms each value to a range between 0 and 1 based on the minimum and maximum observed values by $x' = \frac{x - \min(x)}{\max(x) - \min(x)}$. After the scaling, higher values indicate better essay quality in that dimension.
  • Figure 5: Boxplot of the essay perplexity distribution for different models. The ones in blue colors are from essays generated by 2024 LLMs and the ones in green colors are those from 2023 LLMs.
  • ...and 2 more figures