Table of Contents
Fetching ...

How Prevalent is Gender Bias in ChatGPT? -- Exploring German and English ChatGPT Responses

Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher, Christian Heumann, Stephanie Thiemichen

TL;DR

The paper investigates how gender bias manifests in English and German ChatGPT responses from a non-IT user perspective within university communications. It combines an exploration phase of broad prompts with a focused exploitation phase using two prompts to quantify language use, gender-coded terms, and text length, while tracking grammatical quality and system behavior under unannounced updates. Key findings show English responses are generally textually sound, German responses suffer subtle grammatical issues, and gender prompts can trigger a gender-diversity template that may bias content; exploitation reveals a tendency toward female personas and STEM fields, with notable language differences across languages. The study emphasizes that while ChatGPT can be a helpful drafting tool, non-IT users must carefully proofread outputs and anticipate model updates that can alter results, underscoring the need for broader, cross-model bias analysis and tooling to mitigate discriminatory or unguided content.

Abstract

With the introduction of ChatGPT, OpenAI made large language models (LLM) accessible to users with limited IT expertise. However, users with no background in natural language processing (NLP) might lack a proper understanding of LLMs. Thus the awareness of their inherent limitations, and therefore will take the systems' output at face value. In this paper, we systematically analyse prompts and the generated responses to identify possible problematic issues with a special focus on gender biases, which users need to be aware of when processing the system's output. We explore how ChatGPT reacts in English and German if prompted to answer from a female, male, or neutral perspective. In an in-depth investigation, we examine selected prompts and analyse to what extent responses differ if the system is prompted several times in an identical way. On this basis, we show that ChatGPT is indeed useful for helping non-IT users draft texts for their daily work. However, it is absolutely crucial to thoroughly check the system's responses for biases as well as for syntactic and grammatical mistakes.

How Prevalent is Gender Bias in ChatGPT? -- Exploring German and English ChatGPT Responses

TL;DR

The paper investigates how gender bias manifests in English and German ChatGPT responses from a non-IT user perspective within university communications. It combines an exploration phase of broad prompts with a focused exploitation phase using two prompts to quantify language use, gender-coded terms, and text length, while tracking grammatical quality and system behavior under unannounced updates. Key findings show English responses are generally textually sound, German responses suffer subtle grammatical issues, and gender prompts can trigger a gender-diversity template that may bias content; exploitation reveals a tendency toward female personas and STEM fields, with notable language differences across languages. The study emphasizes that while ChatGPT can be a helpful drafting tool, non-IT users must carefully proofread outputs and anticipate model updates that can alter results, underscoring the need for broader, cross-model bias analysis and tooling to mitigate discriminatory or unguided content.

Abstract

With the introduction of ChatGPT, OpenAI made large language models (LLM) accessible to users with limited IT expertise. However, users with no background in natural language processing (NLP) might lack a proper understanding of LLMs. Thus the awareness of their inherent limitations, and therefore will take the systems' output at face value. In this paper, we systematically analyse prompts and the generated responses to identify possible problematic issues with a special focus on gender biases, which users need to be aware of when processing the system's output. We explore how ChatGPT reacts in English and German if prompted to answer from a female, male, or neutral perspective. In an in-depth investigation, we examine selected prompts and analyse to what extent responses differ if the system is prompted several times in an identical way. On this basis, we show that ChatGPT is indeed useful for helping non-IT users draft texts for their daily work. However, it is absolutely crucial to thoroughly check the system's responses for biases as well as for syntactic and grammatical mistakes.
Paper Structure (27 sections, 4 figures)

This paper contains 27 sections, 4 figures.

Figures (4)

  • Figure 1: Female coded words used on average in all perspectives of English responses (a) and German responses (b) for the prompt about a professor who won a prize. The number of usages is averaged over all responses of a perspective.
  • Figure 2: Male coded words used on average in all perspectives of English responses (a) and German responses (b) for the prompt about a professor who won a prize. The number of usages is averaged over all responses of a perspective.
  • Figure 3: Female coded words used on average in all perspectives of English responses (a) and German responses (b) for characteristics of a good professor prompt. The number of usages is averaged over all responses of a perspective.
  • Figure 4: Male coded words used on average in all perspectives of English responses (a) and German responses (b) for the prompt about the characteristics of a good professor. The number of usages is averaged over all responses of a perspective.