Table of Contents
Fetching ...

Are Female Carpenters like Blue Bananas? A Corpus Investigation of Occupation Gender Typicality

Da Ju, Karen Ulrich, Adina Williams

TL;DR

The study asks whether gender typicality in occupations manifests as a language phenomenon analogous to color stereotypy in bananas. Using information-theoretic metrics on large corpora (Pushshift Reddit, Wikipedia) and grounding in US labor statistics, it shows that gender mentioning correlates with the femaleness of an occupation ($r$ values around $0.49$–$0.50$) rather than with a surprisal-based blue-bananas effect. The strongest and most consistent signals come from Reddit data, with weaker effects in Wikipedia and in LM-generated Wikipedia content, suggesting gender salience is higher for woman-dominated occupations. The results support a nuanced view: gender mentioning in text reflects occupation genderedness and perceived salience rather than generic surprise, with notable qualitative patterns (e.g., more discussions around gender-balance in woman-dominated roles). These findings have implications for understanding linguistic gender biases and for evaluating corpora used to train language models.

Abstract

People tend to use language to mention surprising properties of events: for example, when a banana is blue, we are more likely to mention color than when it is yellow. This fact is taken to suggest that yellowness is somehow a typical feature of bananas, and blueness is exceptional. Similar to how a yellow color is typical of bananas, there may also be genders that are typical of occupations. In this work, we explore this question using information theoretic techniques coupled with corpus statistic analysis. In two distinct large corpora, we do not find strong evidence that occupations and gender display the same patterns of mentioning as do bananas and color. Instead, we find that gender mentioning is correlated with femaleness of occupation in particular, suggesting perhaps that woman-dominated occupations are seen as somehow ``more gendered'' than male-dominated ones, and thereby they encourage more gender mentioning overall.

Are Female Carpenters like Blue Bananas? A Corpus Investigation of Occupation Gender Typicality

TL;DR

The study asks whether gender typicality in occupations manifests as a language phenomenon analogous to color stereotypy in bananas. Using information-theoretic metrics on large corpora (Pushshift Reddit, Wikipedia) and grounding in US labor statistics, it shows that gender mentioning correlates with the femaleness of an occupation ( values around ) rather than with a surprisal-based blue-bananas effect. The strongest and most consistent signals come from Reddit data, with weaker effects in Wikipedia and in LM-generated Wikipedia content, suggesting gender salience is higher for woman-dominated occupations. The results support a nuanced view: gender mentioning in text reflects occupation genderedness and perceived salience rather than generic surprise, with notable qualitative patterns (e.g., more discussions around gender-balance in woman-dominated roles). These findings have implications for understanding linguistic gender biases and for evaluating corpora used to train language models.

Abstract

People tend to use language to mention surprising properties of events: for example, when a banana is blue, we are more likely to mention color than when it is yellow. This fact is taken to suggest that yellowness is somehow a typical feature of bananas, and blueness is exceptional. Similar to how a yellow color is typical of bananas, there may also be genders that are typical of occupations. In this work, we explore this question using information theoretic techniques coupled with corpus statistic analysis. In two distinct large corpora, we do not find strong evidence that occupations and gender display the same patterns of mentioning as do bananas and color. Instead, we find that gender mentioning is correlated with femaleness of occupation in particular, suggesting perhaps that woman-dominated occupations are seen as somehow ``more gendered'' than male-dominated ones, and thereby they encourage more gender mentioning overall.
Paper Structure (43 sections, 3 equations, 8 figures, 8 tables)

This paper contains 43 sections, 3 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: We found the strongest correlation between the femaleness of an occupation (according to US labor statistics) and gender mentioning in Pushshift.io Reddit, a surprising finding to some extent, because it contradicts the idea that gender mentioning occurs when special events are being pointed out. Instead, this finding points more to a gender-specific phenomenon.
  • Figure 2: We calculate the conditional mutual information between gender and its mentioning by means of a gender indicator $MI(G;M|J=j)$. Very low mutual information indicates that variables are not correlated. We see a wide spread in MI across occupations $j$. However, we see similar occupations in the top spots for Wikipedia and Pushshift.io Reddit.
  • Figure 3: We found correlations between the femaleness of an occupation (according to US labor statistics) and (a) gender, (b) femaleness, (c) maleness mentioning in Pushshift.io Reddit.
  • Figure 4: Ablation: We tested the correlation of occupation genderness $1 - H(G|J=j)$ and gender mentioning. High occupation genderedness implies either a man- or woman-dominated occupation according to US labor statistics. Observed correlations are weak, eliminating the hypothesis that gender mention is a result of surprise.
  • Figure 5: Corpus comparison: The femaleness of occupation is most strongly correlated with gender mentioning in Pushshift.io Reddit. In Wikipedia, the effect is smaller, and interestingly, it keeps diminishing for Llama 2 Wikipedia.
  • ...and 3 more figures