Table of Contents
Fetching ...

Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring

Kezia Oketch, John P. Lalor, Yi Yang, Ahmed Abbasi

TL;DR

This paper addresses the accessibility divide in large language models by conducting a rigorous, multi-faceted comparison of nine LLMs across automated essay scoring tasks, including both assessment and generation. Using ASAP and FCE datasets, the authors evaluate zero-shot and few-shot performance, fairness via three-way ANOVA, cost analyses, and embedding-based assessments of generated text. They find that open models like Qwen2.5 and Llama 3 approach GPT-4 performance in AES scoring, with substantially lower costs (up to ~37x cheaper) and comparable fairness profiles; open-source models also show growing viability. The results challenge the dominance of closed LLMs and support broader adoption of open ecosystems to democratize access to advanced NLP capabilities while preserving competitive performance and fairness.

Abstract

Closed large language models (LLMs) such as GPT-4 have set state-of-the-art results across a number of NLP tasks and have become central to NLP and machine learning (ML)-driven solutions. Closed LLMs' performance and wide adoption has sparked considerable debate about their accessibility in terms of availability, cost, and transparency. In this study, we perform a rigorous comparative analysis of nine leading LLMs, spanning closed, open, and open-source LLM ecosystems, across text assessment and generation tasks related to automated essay scoring. Our findings reveal that for few-shot learning-based assessment of human generated essays, open LLMs such as Llama 3 and Qwen2.5 perform comparably to GPT-4 in terms of predictive performance, with no significant differences in disparate impact scores when considering age- or race-related fairness. Moreover, Llama 3 offers a substantial cost advantage, being up to 37 times more cost-efficient than GPT-4. For generative tasks, we find that essays generated by top open LLMs are comparable to closed LLMs in terms of their semantic composition/embeddings and ML assessed scores. Our findings challenge the dominance of closed LLMs and highlight the democratizing potential of open LLMs, suggesting they can effectively bridge accessibility divides while maintaining competitive performance and fairness.

Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring

TL;DR

This paper addresses the accessibility divide in large language models by conducting a rigorous, multi-faceted comparison of nine LLMs across automated essay scoring tasks, including both assessment and generation. Using ASAP and FCE datasets, the authors evaluate zero-shot and few-shot performance, fairness via three-way ANOVA, cost analyses, and embedding-based assessments of generated text. They find that open models like Qwen2.5 and Llama 3 approach GPT-4 performance in AES scoring, with substantially lower costs (up to ~37x cheaper) and comparable fairness profiles; open-source models also show growing viability. The results challenge the dominance of closed LLMs and support broader adoption of open ecosystems to democratize access to advanced NLP capabilities while preserving competitive performance and fairness.

Abstract

Closed large language models (LLMs) such as GPT-4 have set state-of-the-art results across a number of NLP tasks and have become central to NLP and machine learning (ML)-driven solutions. Closed LLMs' performance and wide adoption has sparked considerable debate about their accessibility in terms of availability, cost, and transparency. In this study, we perform a rigorous comparative analysis of nine leading LLMs, spanning closed, open, and open-source LLM ecosystems, across text assessment and generation tasks related to automated essay scoring. Our findings reveal that for few-shot learning-based assessment of human generated essays, open LLMs such as Llama 3 and Qwen2.5 perform comparably to GPT-4 in terms of predictive performance, with no significant differences in disparate impact scores when considering age- or race-related fairness. Moreover, Llama 3 offers a substantial cost advantage, being up to 37 times more cost-efficient than GPT-4. For generative tasks, we find that essays generated by top open LLMs are comparable to closed LLMs in terms of their semantic composition/embeddings and ML assessed scores. Our findings challenge the dominance of closed LLMs and highlight the democratizing potential of open LLMs, suggesting they can effectively bridge accessibility divides while maintaining competitive performance and fairness.

Paper Structure

This paper contains 22 sections, 2 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Analysis Framework
  • Figure 2: Few-shot Results Comparing GPT-4 and Qwen2.5 Across Prompt Types.
  • Figure 3: Few-shot Results Comparing $\Delta$ Scores (Human - LLM prediction) Across Assessment Models and Prompt Types. (left) Differences by Race, (right) Differences by Age
  • Figure 4: t-SNE plot of Human and LLM Generated Essays
  • Figure 5: (left) Comparing Scores of Different LLM Assessors for LLMs/Human Generated Text, (right) Interaction Effect Between Respondent and Prompt. Blue Lines Denote Closed LLMs, Orange Denote Open LLMs
  • ...and 3 more figures