Evaluation of Large Language Models: STEM education and Gender Stereotypes

Smilla Due; Sneha Das; Marianne Andersen; Berta Plandolit López; Sniff Andersen Nexø; Line Clemmensen

Evaluation of Large Language Models: STEM education and Gender Stereotypes

Smilla Due, Sneha Das, Marianne Andersen, Berta Plandolit López, Sniff Andersen Nexø, Line Clemmensen

TL;DR

This study probes whether chatGPT reinforces gender stereotypes in children's STEM education choices across English, Danish, Catalan, and Hindi. Using a true-to-user-case design with open-ended prompts that elicit 10 occupation suggestions and vary by age and gender proxies, the authors analyze 320 data points per language via a two-factor ANOVA with Box-Cox transformed counts. They find significant gender differences in STEM suggestion counts across all languages, with boys receiving more STEM items, and observe age- and language-specific interaction effects (e.g., Catalan showing a gender–age interaction; English and Danish showing stronger age effects). The results suggest that LLMs can propagate gender stereotypes in educational contexts, underscoring the need for bias-aware prompt design, cross-cultural validation, and robust mitigation strategies in educational AI tools.

Abstract

Large Language Models (LLMs) have an increasing impact on our lives with use cases such as chatbots, study support, coding support, ideation, writing assistance, and more. Previous studies have revealed linguistic biases in pronouns used to describe professions or adjectives used to describe men vs women. These issues have to some degree been addressed in updated LLM versions, at least to pass existing tests. However, biases may still be present in the models, and repeated use of gender stereotypical language may reinforce the underlying assumptions and are therefore important to examine further. This paper investigates gender biases in LLMs in relation to educational choices through an open-ended, true to user-case experimental design and a quantitative analysis. We investigate the biases in the context of four different cultures, languages, and educational systems (English/US/UK, Danish/DK, Catalan/ES, and Hindi/IN) for ages ranging from 10 to 16 years, corresponding to important educational transition points in the different countries. We find that there are significant and large differences in the ratio of STEM to non-STEM suggested education paths provided by chatGPT when using typical girl vs boy names to prompt lists of suggested things to become. There are generally fewer STEM suggestions in the Danish, Spanish, and Indian context compared to the English. We also find subtle differences in the suggested professions, which we categorise and report.

Evaluation of Large Language Models: STEM education and Gender Stereotypes

TL;DR

Abstract

Evaluation of Large Language Models: STEM education and Gender Stereotypes

Authors

TL;DR

Abstract

Table of Contents

Figures (2)