Detecting Gender Stereotypes in Scratch Programming Tutorials
Isabella Graßl, Benedikt Fein, Gordon Fraser
TL;DR
This paper tackles the persistence of gender stereotypes in Scratch programming tutorials by developing a dedicated framework to identify 'gender stereotype smells' across characters, content, instructions, and programming concepts. It builds an automated toolchain to evaluate 73 real tutorials and 16 LLM-generated projects, revealing that about one-fifth contain stereotype smells and that current LLMs struggle to detect them without structured guidance. While LLMs show potential to aid in generating more inclusive materials, their bias-detection performance is inconsistent, often producing nuanced stereotypes that are harder for educators to notice. The work offers actionable guidance for teachers to assess teaching content and highlights avenues for refining LLM-based generation and evaluation to foster more inclusive computing education.
Abstract
Gender stereotypes in introductory programming courses often go unnoticed, yet they can negatively influence young learners' interest and learning, particularly under-represented groups such as girls. Popular tutorials on block-based programming with Scratch may unintentionally reinforce biases through character choices, narrative framing, or activity types. Educators currently lack support in identifying and addressing such bias. With large language models~(LLMs) increasingly used to generate teaching materials, this problem is potentially exacerbated by LLMs trained on biased datasets. However, LLMs also offer an opportunity to address this issue. In this paper, we explore the use of LLMs for automatically identifying gender-stereotypical elements in Scratch tutorials, thus offering feedback on how to improve teaching content. We develop a framework for assessing gender bias considering characters, content, instructions, and programming concepts. Analogous to how code analysis tools provide feedback on code in terms of code smells, we operationalise this framework using an automated tool chain that identifies *gender stereotype smells*. Evaluation on 73 popular Scratch tutorials from leading educational platforms demonstrates that stereotype smells are common in practice. LLMs are not effective at detecting them, but our gender bias evaluation framework can guide LLMs in generating tutorials with fewer stereotype smells.
