Evaluating Dialect Robustness of Language Models via Conversation Understanding

Dipankar Srirag; Nihar Ranjan Sahoo; Aditya Joshi

Evaluating Dialect Robustness of Language Models via Conversation Understanding

Dipankar Srirag, Nihar Ranjan Sahoo, Aditya Joshi

TL;DR

This work introduces M-MD3, a dialect-robustness assessment framework for conversation understanding by extending the MD3 taboo-dialogue dataset to four subsets (en-US, en-IN, en-MV, en-TR) and two tasks ($TWP$, $TWS$). It evaluates one open-source (Llama-3) and two closed-source (GPT-4, GPT-3.5) LLMs in pre-trained and dialect-fine-tuned regimes, revealing a consistent US-English advantage and showing that synthetic transformations can both harm and help model understanding depending on the model and task. An in-depth error analysis identifies categories such as Ambiguous Descriptions, Shared Cultural Context, and Public Figures, highlighting where models struggle with dialectal cues and cultural context despite advancements in prompting and fine-tuning. The study provides a novel methodology for dialect-focused dialogue evaluation and demonstrates how dialectal data and transformations can be leveraged to probe and improve model generalization across English varieties, with practical implications for deploying inclusive language technologies. The results underscore the importance of incorporating dialectal diversity in pre-training and fine-tuning to mitigate systematic biases against non-US varieties.

Abstract

With an evergrowing number of LLMs reporting superlative performance for English, their ability to perform equitably for different dialects of English ($\textit{i.e.}$, dialect robustness) needs to be ascertained. Specifically, we use English language (US English or Indian English) conversations between humans who play the word-guessing game of 'taboo'. We formulate two evaluative tasks: target word prediction (TWP) ($\textit{i.e.}$, predict the masked target word in a conversation) and target word selection (TWS) ($\textit{i.e.}$, select the most likely masked target word in a conversation, from among a set of candidate words). Extending MD3, an existing dialectic dataset of taboo-playing conversations, we introduce M-MD3, a target-word-masked version of MD3 with the en-US and en-IN subsets. We create two subsets: en-MV (where en-US is transformed to include dialectal information) and en-TR (where dialectal information is removed from en-IN). We evaluate one open-source (Llama3) and two closed-source (GPT-4/3.5) LLMs. LLMs perform significantly better for US English than Indian English for both TWP and TWS tasks, for all settings, exhibiting marginalisation against the Indian dialect of English. While GPT-based models perform the best, the comparatively smaller models work more equitably after fine-tuning. Our error analysis shows that the LLMs can understand the dialect better after fine-tuning using dialectal data. Our evaluation methodology exhibits a novel way to examine attributes of language models using pre-existing dialogue datasets.

Evaluating Dialect Robustness of Language Models via Conversation Understanding

TL;DR

). It evaluates one open-source (Llama-3) and two closed-source (GPT-4, GPT-3.5) LLMs in pre-trained and dialect-fine-tuned regimes, revealing a consistent US-English advantage and showing that synthetic transformations can both harm and help model understanding depending on the model and task. An in-depth error analysis identifies categories such as Ambiguous Descriptions, Shared Cultural Context, and Public Figures, highlighting where models struggle with dialectal cues and cultural context despite advancements in prompting and fine-tuning. The study provides a novel methodology for dialect-focused dialogue evaluation and demonstrates how dialectal data and transformations can be leveraged to probe and improve model generalization across English varieties, with practical implications for deploying inclusive language technologies. The results underscore the importance of incorporating dialectal diversity in pre-training and fine-tuning to mitigate systematic biases against non-US varieties.

Abstract

With an evergrowing number of LLMs reporting superlative performance for English, their ability to perform equitably for different dialects of English (

, dialect robustness) needs to be ascertained. Specifically, we use English language (US English or Indian English) conversations between humans who play the word-guessing game of 'taboo'. We formulate two evaluative tasks: target word prediction (TWP) (

, predict the masked target word in a conversation) and target word selection (TWS) (

, select the most likely masked target word in a conversation, from among a set of candidate words). Extending MD3, an existing dialectic dataset of taboo-playing conversations, we introduce M-MD3, a target-word-masked version of MD3 with the en-US and en-IN subsets. We create two subsets: en-MV (where en-US is transformed to include dialectal information) and en-TR (where dialectal information is removed from en-IN). We evaluate one open-source (Llama3) and two closed-source (GPT-4/3.5) LLMs. LLMs perform significantly better for US English than Indian English for both TWP and TWS tasks, for all settings, exhibiting marginalisation against the Indian dialect of English. While GPT-based models perform the best, the comparatively smaller models work more equitably after fine-tuning. Our error analysis shows that the LLMs can understand the dialect better after fine-tuning using dialectal data. Our evaluation methodology exhibits a novel way to examine attributes of language models using pre-existing dialogue datasets.

Paper Structure (31 sections, 3 figures, 8 tables)

This paper contains 31 sections, 3 figures, 8 tables.

Introduction
Methodology
en-MV
en-TR
Extending MD3
Analysis
Task Definition
Experiment Setup
Model Parameters
Metrics
Experiments
Results
Quantitative Results
en-US versus en-IN
Impact of transforming conversations
...and 16 more sections

Figures (3)

Figure 1: Illustration of the two tasks: Target word prediction (TWP) and Target word selection (TWS). and are the describer and the guesser respectively in a word-guessing game of taboo. and refer to Indian English and US English respectively.
Figure 2: Steps for evaluation of dialect robustness.
Figure 3: M-MD3 as an extension of MD3: (a) Creation of en-MV and en-TR, and (b) Creation of target-word-masked conversations.

Evaluating Dialect Robustness of Language Models via Conversation Understanding

TL;DR

Abstract

Evaluating Dialect Robustness of Language Models via Conversation Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (3)