Impoverished Language Technology: The Lack of (Social) Class in NLP
Amanda Cercas Curry, Zeerak Talat, Dirk Hovy
TL;DR
The paper investigates the neglect of socio-economic class in NLP and argues for its inclusion as a critical socio-demographic variable. It surveys ACL-era NLP literature to quantify how SES has been measured or ignored, and contrasts NLP practices with established SES measurement in the social sciences. It defines operational SES concepts, reviews objective and subjective metrics (e.g., education, income, occupation and the MacArthur scale), and provides concrete recommendations for researchers to collect and report SES data. By highlighting the lack of representation and potential biases, the authors stress that incorporating social class is essential for fairer, more inclusive NLP systems and datasets. The work also outlines future research avenues and dataset/tools development to better detect and model socio-economic variation in language data.
Abstract
Since Labov's (1964) foundational work on the social stratification of language, linguistics has dedicated concerted efforts towards understanding the relationships between socio-demographic factors and language production and perception. Despite the large body of evidence identifying significant relationships between socio-demographic factors and language production, relatively few of these factors have been investigated in the context of NLP technology. While age and gender are well covered, Labov's initial target, socio-economic class, is largely absent. We survey the existing Natural Language Processing (NLP) literature and find that only 20 papers even mention socio-economic status. However, the majority of those papers do not engage with class beyond collecting information of annotator-demographics. Given this research lacuna, we provide a definition of class that can be operationalised by NLP researchers, and argue for including socio-economic class in future language technologies.
