Table of Contents
Fetching ...

Demographic Dialectal Variation in Social Media: A Case Study of African-American English

Su Lin Blodgett, Lisa Green, Brendan O'Connor

TL;DR

The paper tackles the challenge of dialectal variation in online text by focusing on African-American English (AAE) in Twitter data. It introduces a distantly supervised framework that links geo-demographic census data to language use through direct word-demographic analysis and a mixed-membership demographic-language model, yielding AA- and white-aligned corpora. Linguistic validation shows the AA-aligned text exhibits known AAE phonology and syntax and uncovers novel orthographic patterns, while the study reveals systematic disparities in NLP tools’ performance on AAE versus SAE, prompting an ensemble language identifier that improves English detection. The work provides a sizable AAE-aligned corpus and demonstrates a practical path toward dialect-aware NLP, with potential extensions to other underrepresented dialects through demographic signals and unsupervised modeling.

Abstract

Though dialectal language is increasingly abundant on social media, few resources exist for developing NLP tools to handle such language. We conduct a case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter. We propose a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages, and we verify that this language follows well-known AAE linguistic phenomena. In addition, we analyze the quality of existing language identification and dependency parsing tools on AAE-like text, demonstrating that they perform poorly on such text compared to text associated with white speakers. We also provide an ensemble classifier for language identification which eliminates this disparity and release a new corpus of tweets containing AAE-like language.

Demographic Dialectal Variation in Social Media: A Case Study of African-American English

TL;DR

The paper tackles the challenge of dialectal variation in online text by focusing on African-American English (AAE) in Twitter data. It introduces a distantly supervised framework that links geo-demographic census data to language use through direct word-demographic analysis and a mixed-membership demographic-language model, yielding AA- and white-aligned corpora. Linguistic validation shows the AA-aligned text exhibits known AAE phonology and syntax and uncovers novel orthographic patterns, while the study reveals systematic disparities in NLP tools’ performance on AAE versus SAE, prompting an ensemble language identifier that improves English detection. The work provides a sizable AAE-aligned corpus and demonstrates a practical path toward dialect-aware NLP, with potential extensions to other underrepresented dialects through demographic signals and unsupervised modeling.

Abstract

Though dialectal language is increasingly abundant on social media, few resources exist for developing NLP tools to handle such language. We conduct a case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter. We propose a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages, and we verify that this language follows well-known AAE linguistic phenomena. In addition, we analyze the quality of existing language identification and dependency parsing tools on AAE-like text, demonstrating that they perform poorly on such text compared to text associated with white speakers. We also provide an ensemble classifier for language identification which eliminates this disparity and release a new corpus of tweets containing AAE-like language.

Paper Structure

This paper contains 22 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Mixed-membership model for users ($u$), messages ($m$) and tokens ($t$). Observed variables have a double lined border.
  • Figure 2: Proportion of tweets containing AAE syntactic constructions by messages' posterior probability of AA. On the x-axis, 0.1 refers to the decile [0, 0.1).
  • Figure 3: Proportion of tweets classified as non-English by messages' posterior probability of AA. On the x-axis, 0.1 refers to the decile [0, 0.1).