Table of Contents
Fetching ...

Classist Tools: Social Class Correlates with Performance in NLP

Amanda Cercas Curry, Giuseppe Attanasio, Zeerak Talat, Dirk Hovy

TL;DR

This work demonstrates that socioeconomic status meaningfully shapes language use and NLP performance, using a 95K-utterance dataset of scripted TV and film dialogue annotated for SES, race, and geography. It systematically analyzes lexical cues, ASR accuracy, language-model perplexity, and grammar-correction behavior across SES groups, revealing consistent biases in current NLP systems against lower-SES speech and non-white varieties. The authors combine sociolinguistic theory with empirical NLP experiments to show SES-related disparities across language modelling, ASR, and grammar correction, and argue for SES reporting and SES-aware design in language technologies. The study highlights the practical impact of SES on accessibility and fairness in NLP, and provides a dataset and methodological framework to advance equitable language technologies.

Abstract

Since the foundational work of William Labov on the social stratification of language (Labov, 1964), linguistics has made concentrated efforts to explore the links between sociodemographic characteristics and language production and perception. But while there is strong evidence for socio-demographic characteristics in language, they are infrequently used in Natural Language Processing (NLP). Age and gender are somewhat well represented, but Labov's original target, socioeconomic status, is noticeably absent. And yet it matters. We show empirically that NLP disadvantages less-privileged socioeconomic groups. We annotate a corpus of 95K utterances from movies with social class, ethnicity and geographical language variety and measure the performance of NLP systems on three tasks: language modelling, automatic speech recognition, and grammar error correction. We find significant performance disparities that can be attributed to socioeconomic status as well as ethnicity and geographical differences. With NLP technologies becoming ever more ubiquitous and quotidian, they must accommodate all language varieties to avoid disadvantaging already marginalised groups. We argue for the inclusion of socioeconomic class in future language technologies.

Classist Tools: Social Class Correlates with Performance in NLP

TL;DR

This work demonstrates that socioeconomic status meaningfully shapes language use and NLP performance, using a 95K-utterance dataset of scripted TV and film dialogue annotated for SES, race, and geography. It systematically analyzes lexical cues, ASR accuracy, language-model perplexity, and grammar-correction behavior across SES groups, revealing consistent biases in current NLP systems against lower-SES speech and non-white varieties. The authors combine sociolinguistic theory with empirical NLP experiments to show SES-related disparities across language modelling, ASR, and grammar correction, and argue for SES reporting and SES-aware design in language technologies. The study highlights the practical impact of SES on accessibility and fairness in NLP, and provides a dataset and methodological framework to advance equitable language technologies.

Abstract

Since the foundational work of William Labov on the social stratification of language (Labov, 1964), linguistics has made concentrated efforts to explore the links between sociodemographic characteristics and language production and perception. But while there is strong evidence for socio-demographic characteristics in language, they are infrequently used in Natural Language Processing (NLP). Age and gender are somewhat well represented, but Labov's original target, socioeconomic status, is noticeably absent. And yet it matters. We show empirically that NLP disadvantages less-privileged socioeconomic groups. We annotate a corpus of 95K utterances from movies with social class, ethnicity and geographical language variety and measure the performance of NLP systems on three tasks: language modelling, automatic speech recognition, and grammar error correction. We find significant performance disparities that can be attributed to socioeconomic status as well as ethnicity and geographical differences. With NLP technologies becoming ever more ubiquitous and quotidian, they must accommodate all language varieties to avoid disadvantaging already marginalised groups. We argue for the inclusion of socioeconomic class in future language technologies.
Paper Structure (23 sections, 4 figures, 8 tables)

This paper contains 23 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Word error rate for the different models, grouped by the speakers' SES and race, for US-based TV shows. The race in the chart refers to that of the majority of characters in the show.
  • Figure 2: Mean perplexity among U.K.-based shows by model.
  • Figure 3: Proportion of utterances that are grammar-corrected per model and geographical language variety.
  • Figure 4: Distribution by class of utterances with at least one correction.