Table of Contents
Fetching ...

Can GPT-4 do L2 analytic assessment?

Stefano Bannò, Hari Krishna Vydana, Kate M. Knill, Mark J. F. Gales

TL;DR

This study investigates whether GPT-4 can infer analytic CEFR-based sub-scores from holistic L2 writing assessments, using a zero-shot approach. A Longformer holistic grader provides baseline scores, which GPT-4 then uses to generate nine analytic aspect scores, with results analyzed via correlations to linguistic and discourse features. Despite the absence of ground-truth analytic scores, the experiments reveal significant correlations between GPT-4-derived analytics and the expected componential constructs, especially when holistic inputs are reliable. The work suggests a promising avenue for deriving detailed, actionable feedback from existing holistic scores, while highlighting limitations and directions for validating analytic constructs in language testing.

Abstract

Automated essay scoring (AES) to evaluate second language (L2) proficiency has been a firmly established technology used in educational contexts for decades. Although holistic scoring has seen advancements in AES that match or even exceed human performance, analytic scoring still encounters issues as it inherits flaws and shortcomings from the human scoring process. The recent introduction of large language models presents new opportunities for automating the evaluation of specific aspects of L2 writing proficiency. In this paper, we perform a series of experiments using GPT-4 in a zero-shot fashion on a publicly available dataset annotated with holistic scores based on the Common European Framework of Reference and aim to extract detailed information about their underlying analytic components. We observe significant correlations between the automatically predicted analytic scores and multiple features associated with the individual proficiency components.

Can GPT-4 do L2 analytic assessment?

TL;DR

This study investigates whether GPT-4 can infer analytic CEFR-based sub-scores from holistic L2 writing assessments, using a zero-shot approach. A Longformer holistic grader provides baseline scores, which GPT-4 then uses to generate nine analytic aspect scores, with results analyzed via correlations to linguistic and discourse features. Despite the absence of ground-truth analytic scores, the experiments reveal significant correlations between GPT-4-derived analytics and the expected componential constructs, especially when holistic inputs are reliable. The work suggests a promising avenue for deriving detailed, actionable feedback from existing holistic scores, while highlighting limitations and directions for validating analytic constructs in language testing.

Abstract

Automated essay scoring (AES) to evaluate second language (L2) proficiency has been a firmly established technology used in educational contexts for decades. Although holistic scoring has seen advancements in AES that match or even exceed human performance, analytic scoring still encounters issues as it inherits flaws and shortcomings from the human scoring process. The recent introduction of large language models presents new opportunities for automating the evaluation of specific aspects of L2 writing proficiency. In this paper, we perform a series of experiments using GPT-4 in a zero-shot fashion on a publicly available dataset annotated with holistic scores based on the Common European Framework of Reference and aim to extract detailed information about their underlying analytic components. We observe significant correlations between the automatically predicted analytic scores and multiple features associated with the individual proficiency components.
Paper Structure (21 sections, 2 figures, 6 tables)

This paper contains 21 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The pipeline presented in this study. Grammatical accuracy is only one of the aspects considered.
  • Figure 2: Results of the post-hoc Nemenyi test.