Table of Contents
Fetching ...

AI Gender Bias, Disparities, and Fairness: Does Training Data Matter?

Ehsan Latif, Xiaoming Zhai, Lei Liu

TL;DR

This study tackles gender bias in AI-based automatic scoring of student-written responses by fine-tuning BERT and GPT-3.5 on mixed-gender and gender-specific datasets across six tasks. Using paired t-tests, mean score gap (MSG), and Equalized Odds (EO), the authors show that mixed training yields no significant scoring bias and generally reduces gender disparity and improves fairness compared with gender-specific training. MSG differences between mixed training and human benchmarks are small and statistically insignificant, while gender-specific models exhibit larger MSG and higher EO, suggesting that unbalanced data can widen gaps even if bias in scoring is not inherent. The findings emphasize that training data composition is a crucial lever for achieving fairer AI assessments in education, with practical implications for deploying AI scoring systems in diverse classrooms.

Abstract

This study delves into the pervasive issue of gender issues in artificial intelligence (AI), specifically within automatic scoring systems for student-written responses. The primary objective is to investigate the presence of gender biases, disparities, and fairness in generally targeted training samples with mixed-gender datasets in AI scoring outcomes. Utilizing a fine-tuned version of BERT and GPT-3.5, this research analyzes more than 1000 human-graded student responses from male and female participants across six assessment items. The study employs three distinct techniques for bias analysis: Scoring accuracy difference to evaluate bias, mean score gaps by gender (MSG) to evaluate disparity, and Equalized Odds (EO) to evaluate fairness. The results indicate that scoring accuracy for mixed-trained models shows an insignificant difference from either male- or female-trained models, suggesting no significant scoring bias. Consistently with both BERT and GPT-3.5, we found that mixed-trained models generated fewer MSG and non-disparate predictions compared to humans. In contrast, compared to humans, gender-specifically trained models yielded larger MSG, indicating that unbalanced training data may create algorithmic models to enlarge gender disparities. The EO analysis suggests that mixed-trained models generated more fairness outcomes compared with gender-specifically trained models. Collectively, the findings suggest that gender-unbalanced data do not necessarily generate scoring bias but can enlarge gender disparities and reduce scoring fairness.

AI Gender Bias, Disparities, and Fairness: Does Training Data Matter?

TL;DR

This study tackles gender bias in AI-based automatic scoring of student-written responses by fine-tuning BERT and GPT-3.5 on mixed-gender and gender-specific datasets across six tasks. Using paired t-tests, mean score gap (MSG), and Equalized Odds (EO), the authors show that mixed training yields no significant scoring bias and generally reduces gender disparity and improves fairness compared with gender-specific training. MSG differences between mixed training and human benchmarks are small and statistically insignificant, while gender-specific models exhibit larger MSG and higher EO, suggesting that unbalanced data can widen gaps even if bias in scoring is not inherent. The findings emphasize that training data composition is a crucial lever for achieving fairer AI assessments in education, with practical implications for deploying AI scoring systems in diverse classrooms.

Abstract

This study delves into the pervasive issue of gender issues in artificial intelligence (AI), specifically within automatic scoring systems for student-written responses. The primary objective is to investigate the presence of gender biases, disparities, and fairness in generally targeted training samples with mixed-gender datasets in AI scoring outcomes. Utilizing a fine-tuned version of BERT and GPT-3.5, this research analyzes more than 1000 human-graded student responses from male and female participants across six assessment items. The study employs three distinct techniques for bias analysis: Scoring accuracy difference to evaluate bias, mean score gaps by gender (MSG) to evaluate disparity, and Equalized Odds (EO) to evaluate fairness. The results indicate that scoring accuracy for mixed-trained models shows an insignificant difference from either male- or female-trained models, suggesting no significant scoring bias. Consistently with both BERT and GPT-3.5, we found that mixed-trained models generated fewer MSG and non-disparate predictions compared to humans. In contrast, compared to humans, gender-specifically trained models yielded larger MSG, indicating that unbalanced training data may create algorithmic models to enlarge gender disparities. The EO analysis suggests that mixed-trained models generated more fairness outcomes compared with gender-specifically trained models. Collectively, the findings suggest that gender-unbalanced data do not necessarily generate scoring bias but can enlarge gender disparities and reduce scoring fairness.
Paper Structure (18 sections, 3 equations, 3 figures, 2 tables)

This paper contains 18 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of AI gender bias analysis for automatic scoring.
  • Figure 2: Example Multi-class Task: Falling weights
  • Figure 3: Mean score gaps between male and female testing data: Comparing human-graded scores and fine-tuned LLMs' scores: Mixed training Model (Left), Male trained Model (Center), and Female trained Model (Right) comparison line plots of BERT (Top) and GPT3.5 (Bottom) fine-tuned models's scores. The shaded region in each plot signifies the $\Delta MSG$.