Table of Contents
Fetching ...

Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

Rudra Jadhav, Janhavi Danve, Sonalika Shaw

Abstract

As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs -- LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) -- were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p < 0.05), with effect sizes ranging from medium (Cohen's d = 0.64) to very large (d = 4.25). Informal language received the heaviest penalty, with LLaMA deducting an average of 1.90 points and Qwen deducting 1.20 points on a 10-point scale -- penalties comparable to the difference between a B+ and C+ letter grade. Non-native phrasing was penalized 1.35 and 0.90 points respectively. In sharp contrast, Mathematics and Programming tasks showed minimal bias, with most conditions failing to reach statistical significance. These findings demonstrate that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions in the grading prompt. We discuss implications for equitable deployment of LLM-based grading systems and recommend bias auditing protocols before institutional adoption.

Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

Abstract

As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs -- LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) -- were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p < 0.05), with effect sizes ranging from medium (Cohen's d = 0.64) to very large (d = 4.25). Informal language received the heaviest penalty, with LLaMA deducting an average of 1.90 points and Qwen deducting 1.20 points on a 10-point scale -- penalties comparable to the difference between a B+ and C+ letter grade. Non-native phrasing was penalized 1.35 and 0.90 points respectively. In sharp contrast, Mathematics and Programming tasks showed minimal bias, with most conditions failing to reach statistical significance. These findings demonstrate that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions in the grading prompt. We discuss implications for equitable deployment of LLM-based grading systems and recommend bias auditing protocols before institutional adoption.
Paper Structure (24 sections, 5 figures, 2 tables)

This paper contains 24 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Grading bias by perturbation type for Essay/Writing tasks. Both models show statistically significant penalties ($* = p < 0.05$) across all three perturbation types. Informal language consistently receives the largest penalty, followed by non-native phrasing and grammar errors.
  • Figure 2: Bias heatmap for Essay/Writing. Numerical values represent mean score penalties. LLaMA exhibits uniformly higher penalties than Qwen, with the LLaMA--Informal cell showing the maximum observed bias of 1.90 points.
  • Figure 3: Bias comparison across subjects for both models. Essay/Writing bias dramatically exceeds Math and Programming, which show near-zero penalties. This "subjectivity gradient" is consistent across both models.
  • Figure 4: Cohen's $d$ effect sizes across all 18 experimental conditions. Diamonds indicate statistically significant results ($p < 0.05$); circles indicate non-significant. Vertical reference lines mark conventional thresholds: small (0.2), medium (0.5), and large (0.8). Essay/Writing conditions cluster far into the large-effect region.
  • Figure 5: Human ground-truth vs. Qwen 2.5 72B scores across subjects. In the Essay panel, blue dots (informal language) and green dots (non-native phrasing) consistently fall below the perfect agreement diagonal, directly visualizing the systematic score penalty.