Table of Contents
Fetching ...

My LLM might Mimic AAE -- But When Should it?

Sandra C. Sandoval, Christabel Acquaye, Kwesi Cobbina, Mohammad Nayeem Teli, Hal Daumé

TL;DR

The paper addresses how Black Americans view and want African American English (AAE) represented in AI, and whether large language models can authentically generate AAE when prompted. It combines a survey of 104 participants with an annotation task involving 228 annotators to compare LLM outputs from three prominent models against human AAE baselines from CORAAL and Twitter corpora. The results show nuanced preferences for dialect use, with formal tasks favoring Mainstream U.S. English and casual contexts allowing AAE when users choose, and findings that LLM-generated AAE can be as authentic as human speech while remaining non-offensive. These insights support broader, ethically guided inclusion of dialect diversity in AI, while underscoring the need for safeguards against offensive or mocking outputs.

Abstract

We examine the representation of African American English (AAE) in large language models (LLMs), exploring (a) the perceptions Black Americans have of how effective these technologies are at producing authentic AAE, and (b) in what contexts Black Americans find this desirable. Through both a survey of Black Americans ($n=$ 104) and annotation of LLM-produced AAE by Black Americans ($n=$ 228), we find that Black Americans favor choice and autonomy in determining when AAE is appropriate in LLM output. They tend to prefer that LLMs default to communicating in Mainstream U.S. English in formal settings, with greater interest in AAE production in less formal settings. When LLMs were appropriately prompted and provided in context examples, our participants found their outputs to have a level of AAE authenticity on par with transcripts of Black American speech. Select code and data for our project can be found here: https://github.com/smelliecat/AAEMime.git

My LLM might Mimic AAE -- But When Should it?

TL;DR

The paper addresses how Black Americans view and want African American English (AAE) represented in AI, and whether large language models can authentically generate AAE when prompted. It combines a survey of 104 participants with an annotation task involving 228 annotators to compare LLM outputs from three prominent models against human AAE baselines from CORAAL and Twitter corpora. The results show nuanced preferences for dialect use, with formal tasks favoring Mainstream U.S. English and casual contexts allowing AAE when users choose, and findings that LLM-generated AAE can be as authentic as human speech while remaining non-offensive. These insights support broader, ethically guided inclusion of dialect diversity in AI, while underscoring the need for safeguards against offensive or mocking outputs.

Abstract

We examine the representation of African American English (AAE) in large language models (LLMs), exploring (a) the perceptions Black Americans have of how effective these technologies are at producing authentic AAE, and (b) in what contexts Black Americans find this desirable. Through both a survey of Black Americans ( 104) and annotation of LLM-produced AAE by Black Americans ( 228), we find that Black Americans favor choice and autonomy in determining when AAE is appropriate in LLM output. They tend to prefer that LLMs default to communicating in Mainstream U.S. English in formal settings, with greater interest in AAE production in less formal settings. When LLMs were appropriately prompted and provided in context examples, our participants found their outputs to have a level of AAE authenticity on par with transcripts of Black American speech. Select code and data for our project can be found here: https://github.com/smelliecat/AAEMime.git

Paper Structure

This paper contains 82 sections, 28 figures, 8 tables.

Figures (28)

  • Figure 1: This heatmap depicts participant ($n=$ 104) preferences (horizontal axis) for the use of language varieties in seven scenarios (vertical axis). A greater number of participants preferred either for the system to use MUSE or to allow them to select between MUSE and AAE. There were some exceptions: e.g., auto-detection was considered more acceptable in SMS, and MUSE was preferred for email.
  • Figure 2: Examples of response continuations generated by Mixtral, Llama, and GPT, with annotation scores based on human participants’ linguistic judgments.
  • Figure 3: Sample question from the survey on participants preference in a realistic scenario.
  • Figure 4: Sample question from annotation task where participants are asked to consider the highlighted, underlined part of the interviewee's response, which is and mark their level of agreement with the following statements
  • Figure 5: Left: Bar Plot of Gender Distribution Among Respondents: This graph displays the count of survey participants according to their gender identification, including Female, Male, Non-Binary, Undisclosed, and Other. The largest groups are Female and Male, with significant representation, while Non-Binary and Other categories show fewer participants. The 'Undisclosed' category represents respondents who preferred not to specify their gender. Right: Bar Plot of Respondent Age Distribution: This graph quantifies the distribution of survey respondents across various age groups. The largest groups are those aged 25-34 and 35-44, demonstrating strong participation from these demographics. In contrast, the 55-64 age group has the fewest respondents. The category labeled 'Und' represents those who preferred not to disclose their age.
  • ...and 23 more figures