Table of Contents
Fetching ...

Predicting the Target Word of Game-playing Conversations using a Low-Rank Dialect Adapter for Decoder Models

Dipankar Srirag, Aditya Joshi, Jacob Eisenstein

TL;DR

This work addresses dialect robustness in decoder models for target word prediction (TWP) on masked, game-like dialogues from MD-3, comparing en-US with en-IN and en-NG. It introduces LoRDD, a dual low-rank adapter architecture consisting of a task adapter optimized with $ \mathcal{L}_{\texttt{Task}}$ and a dialect adapter optimized with a contrastive loss $ \mathcal{L}_{\texttt{Dial}}$, trained on a mix of augmented US-English data and a pseudo-parallel corpus of natural conversations. Using two open-weight decoders (Mistral and Gemma), LoRDD outperforms in-dialect and cross-dialect baselines and substantially narrows the gap to the en-US skyline (e.g., en-IN similarity/accuracy gaps reduced to $12\%$/$25\% and en-NG gaps to $5.8\%$/$4.5\%). The ablation studies confirm the necessity of both adapters and natural parallel data, underscoring the potential of dialect adaptation for decoder models and guiding future work toward broader causal language modeling tasks and dialects.

Abstract

Dialect adapters that improve the performance of LLMs for NLU tasks on certain sociolects/dialects/national varieties ('dialects' for the sake of brevity) have been reported for encoder models. In this paper, we extend the idea of dialect adapters to decoder models in our architecture called LoRDD. Using MD-3, a publicly available dataset of word game-playing conversations between dialectal speakers, our task is Target Word Prediction (TWP) from a masked conversation. LoRDD combines task adapters and dialect adapters where the latter employ contrastive learning on pseudo-parallel conversations from MD-3. Our experiments on Indian English and Nigerian English conversations with two models (Mistral and Gemma) demonstrate that LoRDD outperforms four baselines on TWP. Additionally, it significantly reduces the performance gap with American English, narrowing it to 12% and 5.8% for word similarity, and 25% and 4.5% for accuracy, respectively. The focused contribution of LoRDD is in its promise for dialect adaptation of decoder models using TWP, a simplified version of the commonly used next-word prediction task.

Predicting the Target Word of Game-playing Conversations using a Low-Rank Dialect Adapter for Decoder Models

TL;DR

This work addresses dialect robustness in decoder models for target word prediction (TWP) on masked, game-like dialogues from MD-3, comparing en-US with en-IN and en-NG. It introduces LoRDD, a dual low-rank adapter architecture consisting of a task adapter optimized with and a dialect adapter optimized with a contrastive loss , trained on a mix of augmented US-English data and a pseudo-parallel corpus of natural conversations. Using two open-weight decoders (Mistral and Gemma), LoRDD outperforms in-dialect and cross-dialect baselines and substantially narrows the gap to the en-US skyline (e.g., en-IN similarity/accuracy gaps reduced to /5.8\%4.5\%). The ablation studies confirm the necessity of both adapters and natural parallel data, underscoring the potential of dialect adaptation for decoder models and guiding future work toward broader causal language modeling tasks and dialects.

Abstract

Dialect adapters that improve the performance of LLMs for NLU tasks on certain sociolects/dialects/national varieties ('dialects' for the sake of brevity) have been reported for encoder models. In this paper, we extend the idea of dialect adapters to decoder models in our architecture called LoRDD. Using MD-3, a publicly available dataset of word game-playing conversations between dialectal speakers, our task is Target Word Prediction (TWP) from a masked conversation. LoRDD combines task adapters and dialect adapters where the latter employ contrastive learning on pseudo-parallel conversations from MD-3. Our experiments on Indian English and Nigerian English conversations with two models (Mistral and Gemma) demonstrate that LoRDD outperforms four baselines on TWP. Additionally, it significantly reduces the performance gap with American English, narrowing it to 12% and 5.8% for word similarity, and 25% and 4.5% for accuracy, respectively. The focused contribution of LoRDD is in its promise for dialect adaptation of decoder models using TWP, a simplified version of the commonly used next-word prediction task.
Paper Structure (13 sections, 2 equations, 3 figures, 10 tables)

This paper contains 13 sections, 2 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Illustrative example of Target Word Prediction on an en-IN conversation. The inaccurate output from the in-dialect fine-tuned model (left) is corrected by the model trained using LoRDD (right).
  • Figure 2: Architecture of LoRDD.
  • Figure 3: Percentage count of dialect features in erroneous instances from LoRDD.