Table of Contents
Fetching ...

Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings

Veronika Hackl, Alexandra Elena Müller, Michael Granitzer, Maximilian Sailer

TL;DR

This work assesses the reliability of GPT-4 when providing feedback on macroeconomics responses in higher education, focusing on content and style ratings across iterations, time spans, and stylistic variations. Using a tightly controlled prompt framework with in-context demonstrations and deterministic settings, the study computes intraclass correlation coefficients and the relationship between content and style judgments. Results show very high interrater reliability (ICC 0.94–0.999) and a strong content-style correlation (0.87), with content ratings largely robust to stylistic paraphrasing. The findings support GPT-4’s potential for consistent automated feedback under controlled prompts, while highlighting the need for further research into long-term reliability, transparency, and broader use cases in education.

Abstract

This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4, a state-of-the-art artificial intelligence language model, across multiple iterations, time spans and stylistic variations. The model rated responses to tasks within the Higher Education (HE) subject domain of macroeconomics in terms of their content and style. Statistical analysis was conducted in order to learn more about the interrater reliability, consistency of the ratings across iterations and the correlation between ratings in terms of content and style. The results revealed a high interrater reliability with ICC scores ranging between 0.94 and 0.99 for different timespans, suggesting that GPT-4 is capable of generating consistent ratings across repetitions with a clear prompt. Style and content ratings show a high correlation of 0.87. When applying a non-adequate style the average content ratings remained constant, while style ratings decreased, which indicates that the large language model (LLM) effectively distinguishes between these two criteria during evaluation. The prompt used in this study is furthermore presented and explained. Further research is necessary to assess the robustness and reliability of AI models in various use cases.

Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings

TL;DR

This work assesses the reliability of GPT-4 when providing feedback on macroeconomics responses in higher education, focusing on content and style ratings across iterations, time spans, and stylistic variations. Using a tightly controlled prompt framework with in-context demonstrations and deterministic settings, the study computes intraclass correlation coefficients and the relationship between content and style judgments. Results show very high interrater reliability (ICC 0.94–0.999) and a strong content-style correlation (0.87), with content ratings largely robust to stylistic paraphrasing. The findings support GPT-4’s potential for consistent automated feedback under controlled prompts, while highlighting the need for further research into long-term reliability, transparency, and broader use cases in education.

Abstract

This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4, a state-of-the-art artificial intelligence language model, across multiple iterations, time spans and stylistic variations. The model rated responses to tasks within the Higher Education (HE) subject domain of macroeconomics in terms of their content and style. Statistical analysis was conducted in order to learn more about the interrater reliability, consistency of the ratings across iterations and the correlation between ratings in terms of content and style. The results revealed a high interrater reliability with ICC scores ranging between 0.94 and 0.99 for different timespans, suggesting that GPT-4 is capable of generating consistent ratings across repetitions with a clear prompt. Style and content ratings show a high correlation of 0.87. When applying a non-adequate style the average content ratings remained constant, while style ratings decreased, which indicates that the large language model (LLM) effectively distinguishes between these two criteria during evaluation. The prompt used in this study is furthermore presented and explained. Further research is necessary to assess the robustness and reliability of AI models in various use cases.
Paper Structure (17 sections, 1 figure, 5 tables)