Are Female Carpenters like Blue Bananas? A Corpus Investigation of Occupation Gender Typicality
Da Ju, Karen Ulrich, Adina Williams
TL;DR
The study asks whether gender typicality in occupations manifests as a language phenomenon analogous to color stereotypy in bananas. Using information-theoretic metrics on large corpora (Pushshift Reddit, Wikipedia) and grounding in US labor statistics, it shows that gender mentioning correlates with the femaleness of an occupation ($r$ values around $0.49$–$0.50$) rather than with a surprisal-based blue-bananas effect. The strongest and most consistent signals come from Reddit data, with weaker effects in Wikipedia and in LM-generated Wikipedia content, suggesting gender salience is higher for woman-dominated occupations. The results support a nuanced view: gender mentioning in text reflects occupation genderedness and perceived salience rather than generic surprise, with notable qualitative patterns (e.g., more discussions around gender-balance in woman-dominated roles). These findings have implications for understanding linguistic gender biases and for evaluating corpora used to train language models.
Abstract
People tend to use language to mention surprising properties of events: for example, when a banana is blue, we are more likely to mention color than when it is yellow. This fact is taken to suggest that yellowness is somehow a typical feature of bananas, and blueness is exceptional. Similar to how a yellow color is typical of bananas, there may also be genders that are typical of occupations. In this work, we explore this question using information theoretic techniques coupled with corpus statistic analysis. In two distinct large corpora, we do not find strong evidence that occupations and gender display the same patterns of mentioning as do bananas and color. Instead, we find that gender mentioning is correlated with femaleness of occupation in particular, suggesting perhaps that woman-dominated occupations are seen as somehow ``more gendered'' than male-dominated ones, and thereby they encourage more gender mentioning overall.
