FEANEL: A Benchmark for Fine-Grained Error Analysis in K-12 English Writing
Jingheng Ye, Shen Wang, Jiaqi Chen, Hebin Wang, Deqing Zou, Yanyu Zhu, Jiwei Tang, Hai-Tao Zheng, Ruitong Liu, Haoyang Li, Yanfeng Wang, Qingsong Wen
TL;DR
<FEANEL> introduces a fine-grained error-analysis benchmark for K-12 English writing, pairing 1,000 expert-annotated essays with a POS-driven error taxonomy to evaluate LLMs’ ability to classify errors, rate severity, and provide pedagogical explanations. The authors define the per-edit analysis problem, build a rigorous dataset via data collection, cleaning, and annotation, and propose comprehensive evaluation metrics that go beyond holistic scoring. Through extensive experiments across many models and prompt settings, FEANEL reveals substantial gaps in current LLMs’ fine-grained pedagogical feedback, with performance heavily influenced by prompt design, model scale, and reasoning strategies. Human teachers still outperform AI on this task, highlighting the need for better alignment and educationally grounded feedback mechanisms in LLMs. FEANEL thus offers a foundational benchmark to push toward more interpretable and effective AI-assisted language learning tools.
Abstract
Large Language Models (LLMs) have transformed artificial intelligence, offering profound opportunities for educational applications. However, their ability to provide fine-grained educational feedback for K-12 English writing remains underexplored. In this paper, we challenge the error analysis and pedagogical skills of LLMs by introducing the problem of Fine-grained Error Analysis for English Learners and present the Fine-grained Error ANalysis for English Learners (FEANEL) Benchmark. The benchmark comprises 1,000 essays written by elementary and secondary school students, and a well-developed English writing error taxonomy. Each error is annotated by language education experts and categorized by type, severity, and explanatory feedback, using a part-of-speech-based taxonomy they co-developed. We evaluate state-of-the-art LLMs on the FEANEL Benchmark to explore their error analysis and pedagogical abilities. Experimental results reveal significant gaps in current LLMs' ability to perform fine-grained error analysis, highlighting the need for advancements in particular methods for educational applications.
