Unveiling Gender Bias in Large Language Models: Using Teacher's Evaluation in Higher Education As an Example

Yuanning Huang

Unveiling Gender Bias in Large Language Models: Using Teacher's Evaluation in Higher Education As an Example

Yuanning Huang

TL;DR

The study investigates gender bias in GPT-4-generated teacher evaluations in higher education by applying a multi-method framework that includes Odds Ratio analysis, Word Embedding Association Test, sentiment analysis, and contextual analysis across six subjects. It finds that language associated with female instructors emphasizes approachability and support (communal), while male instructors are linked to entertainment and agentic descriptors; WEAT links male salient adjectives with male names, though career/family terms show weaker discriminatory power in this context. Sentiment trends generally favor female instructors, though context reveals nuanced biases, and qualitative word usage in Engineering demonstrates gender-specific interpretation. Overall, the results indicate that LLM-generated evaluations reflect and potentially reinforce societal gender biases, underscoring the need for bias auditing and mitigation in AI-assisted educational contexts.

Abstract

This paper investigates gender bias in Large Language Model (LLM)-generated teacher evaluations in higher education setting, focusing on evaluations produced by GPT-4 across six academic subjects. By applying a comprehensive analytical framework that includes Odds Ratio (OR) analysis, Word Embedding Association Test (WEAT), sentiment analysis, and contextual analysis, this paper identified patterns of gender-associated language reflecting societal stereotypes. Specifically, words related to approachability and support were used more frequently for female instructors, while words related to entertainment were predominantly used for male instructors, aligning with the concepts of communal and agentic behaviors. The study also found moderate to strong associations between male salient adjectives and male names, though career and family words did not distinctly capture gender biases. These findings align with prior research on societal norms and stereotypes, reinforcing the notion that LLM-generated text reflects existing biases.

Unveiling Gender Bias in Large Language Models: Using Teacher's Evaluation in Higher Education As an Example

TL;DR

Abstract

Paper Structure (22 sections, 4 equations, 7 figures, 3 tables)

This paper contains 22 sections, 4 equations, 7 figures, 3 tables.

Introduction
Literature Review
Gender Biases in Language
Gender Biases in Machine Learning and LLMs
Definitions of Gender Biases in AI and LLMs
Gender Biases in Performance Reviews and Teacher Evaluations
Data and Methods
Data Generation
Methods
Odds Ratio Analysis
WEAT Score Analysis
Sentiment Analysis
Contextual Analysis
Findings
Odds Ratio Analysis
...and 7 more sections

Figures (7)

Figure 1: Log Transformed Distribution of OR Scores for Salient Adjectives by Subject
Figure 2: Salient adjectives for each gender divided by subject area. Note the OR score here is after log transformation.
Figure 3: Salient adjectives for each gender divided by subject area. Note the OR score here is after log transformation. Blue: words related to approachability and support. Green: words related to entertainment. Red: words related to excellence and distinction.
Figure 4: Salient adjectives for each gender divided by subject area. Note the OR score here is after log transformation. Blue: words related to approachability and support. Green: words related to entertainment. Red: words related to excellence and distinction. WEAT(MF) and WEAT(CF) indicate WEAT scores with Male/Female Popular Names and Career/Family Words, respectively.
Figure 5: Salient adjectives for each gender divided by subject area. Note the OR score here is after log transformation. Blue: words related to approachability and support. Green: words related to entertainment. Red: words related to excellence and distinction. Sentiment Score Male and Sentiment Score Female refer to sentiment scores obtained by TextBlob using male or female salient adjectives respectively
...and 2 more figures

Unveiling Gender Bias in Large Language Models: Using Teacher's Evaluation in Higher Education As an Example

TL;DR

Abstract

Unveiling Gender Bias in Large Language Models: Using Teacher's Evaluation in Higher Education As an Example

Authors

TL;DR

Abstract

Table of Contents

Figures (7)