Table of Contents
Fetching ...

A Multilingual Perspective on Probing Gender Bias

Karolina Stańczak

TL;DR

This work tackles gender bias in multilingual NLP by integrating gender theories, linguistic phenomena, and probing techniques across languages. It introduces dataset collections (social media, historical newspapers, ambassadors on Twitter) and novel probing methods (latent-variable intrinsic probes; template-based and identity-diverse datasets) to examine linguistic and societal biases encoded in language and language models. The studies reveal multilingual and cross-linguistic patterns, including causal analyses challenging the notion that grammatical gender strongly drives adjective choice, and highlight that bias manifests in nuanced ways across visibility, language, and sentiment rather than solely through negativity. Collectively, the contributions advance multidisciplinary bias analysis, provide benchmarks for multilingual bias detection, and offer methodological guidance for probing both natural language and language models in diverse linguistic contexts.

Abstract

Gender bias represents a form of systematic negative treatment that targets individuals based on their gender. This discrimination can range from subtle sexist remarks and gendered stereotypes to outright hate speech. Prior research has revealed that ignoring online abuse not only affects the individuals targeted but also has broader societal implications. These consequences extend to the discouragement of women's engagement and visibility within public spheres, thereby reinforcing gender inequality. This thesis investigates the nuances of how gender bias is expressed through language and within language technologies. Significantly, this thesis expands research on gender bias to multilingual contexts, emphasising the importance of a multilingual and multicultural perspective in understanding societal biases. In this thesis, I adopt an interdisciplinary approach, bridging natural language processing with other disciplines such as political science and history, to probe gender bias in natural language and language models.

A Multilingual Perspective on Probing Gender Bias

TL;DR

This work tackles gender bias in multilingual NLP by integrating gender theories, linguistic phenomena, and probing techniques across languages. It introduces dataset collections (social media, historical newspapers, ambassadors on Twitter) and novel probing methods (latent-variable intrinsic probes; template-based and identity-diverse datasets) to examine linguistic and societal biases encoded in language and language models. The studies reveal multilingual and cross-linguistic patterns, including causal analyses challenging the notion that grammatical gender strongly drives adjective choice, and highlight that bias manifests in nuanced ways across visibility, language, and sentiment rather than solely through negativity. Collectively, the contributions advance multidisciplinary bias analysis, provide benchmarks for multilingual bias detection, and offer methodological guidance for probing both natural language and language models in diverse linguistic contexts.

Abstract

Gender bias represents a form of systematic negative treatment that targets individuals based on their gender. This discrimination can range from subtle sexist remarks and gendered stereotypes to outright hate speech. Prior research has revealed that ignoring online abuse not only affects the individuals targeted but also has broader societal implications. These consequences extend to the discouragement of women's engagement and visibility within public spheres, thereby reinforcing gender inequality. This thesis investigates the nuances of how gender bias is expressed through language and within language technologies. Significantly, this thesis expands research on gender bias to multilingual contexts, emphasising the importance of a multilingual and multicultural perspective in understanding societal biases. In this thesis, I adopt an interdisciplinary approach, bridging natural language processing with other disciplines such as political science and history, to probe gender bias in natural language and language models.
Paper Structure (335 sections, 2 theorems, 65 equations, 73 figures, 83 tables)

This paper contains 335 sections, 2 theorems, 65 equations, 73 figures, 83 tables.

Key Result

Proposition 1

Let ${\color{MyPurple}{A}}$ and ${\color{OliveGreen}{G}}$ be ${\color{MyPurple}\mathcal{A}}$-valued and ${\color{OliveGreen}\mathcal{G}}$-valued random variables, respectively. Further assume they are jointly distributed according to $p({\color{MyPurple}a} \mid \mathrm{do}({\color{OliveGreen}{G}}={\ where $\mathrm{MI}_{\mathrm{do}}({\color{MyPurple}{A}}; {\color{OliveGreen}{G}})$ is the mutual inf

Figures (73)

  • Figure 1: Cumulative number of papers published on gender bias prior to June 2021.
  • Figure 1: PMI analysis of our historical corpora. Words are placed on the intersectional gender/race plane.
  • Figure 2: An expansion of the use of naming conventions for politicians across the partisan divide of the data (see along the y-axis).
  • Figure 2: An example of a scanned newspaper (a) and the output of the OCR tool Tesseract (b). We fix simple OCR errors (highlighted) using a rule-based approach.
  • Figure 3: Number of ambassadors on Twitter by country of origin.
  • ...and 68 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 1