Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals

Yupei Wang; Renfen Hu; Zhe Zhao

Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals

Yupei Wang, Renfen Hu, Zhe Zhao

TL;DR

The proposed method, using counterfactual intervention assisted by Large Language Models, reveals that BERT-like models primarily focus on sentence-level features, whereas LLMs such as GPT-3.5, GPT-4 and Llama-3 are sensitive to conventions&accuracy, language complexity, and organization, indicating a more comprehensive rationale alignment with scoring rubrics.

Abstract

While current Automated Essay Scoring (AES) methods demonstrate high scoring agreement with human raters, their decision-making mechanisms are not fully understood. Our proposed method, using counterfactual intervention assisted by Large Language Models (LLMs), reveals that BERT-like models primarily focus on sentence-level features, whereas LLMs such as GPT-3.5, GPT-4 and Llama-3 are sensitive to conventions & accuracy, language complexity, and organization, indicating a more comprehensive rationale alignment with scoring rubrics. Moreover, LLMs can discern counterfactual interventions when giving feedback on essays. Our approach improves understanding of neural AES methods and can also apply to other domains seeking transparency in model-driven decisions.

Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals

TL;DR

Abstract

Paper Structure (45 sections, 5 equations, 4 figures, 14 tables)

This paper contains 45 sections, 5 equations, 4 figures, 14 tables.

Introduction
Related Work
AES based on Neural Language Models
Interpretability and Robustness of AES Models
Counterfactual Analysis
Method
Concepts for Intervention
Measurement of Rationale Alignment
Counterfactual Generation
The Validity of LLM Generated Counterfactuals
Experiments
Settings
Counterfactual Validation Results
Scoring Results
Feedback Analysis
...and 30 more sections

Figures (4)

Figure 1: The pipeline of our proposed method.
Figure 2: Cohen's $\mathcal{D}$ measured for seven linguistic metrics on three interventions.
Figure 3: Scoring performance of GPT-3.5 SFT models with varying size of training data. The models' performance improves as the number of training samples increases, reaching comparable or equivalent levels to BERT-like models.
Figure 7: The pipeline of our proposed method.

Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals

TL;DR

Abstract

Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals

Authors

TL;DR

Abstract

Table of Contents

Figures (4)