Table of Contents
Fetching ...

DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models

Niyati Bafna, Emily Chang, Nathaniel R. Robinson, David R. Mortensen, Kenton Murray, David Yarowsky, Hale Sirin

TL;DR

DialUp presents a principled, resource-light approach to extend pretrained HRL-based MT systems to closely related dialect continua. It combines two strategies: M->D, which trains on synthetic dialectal variations to improve robustness to unseen CRLs, and D->M, which at inference time re-aligns CRL input to HRL-like expressions using bilingual lexicons, with a focus on function words. Across 49 CRLs from six language families and two base models, both strategies yield substantial gains, with M->D offering broad, lower-variance improvements and D->M delivering notable benefits for select, low-baseline CRLs; the combination M<->D often performs best. Analyses show gains correlate with baseline performance and HRL-CRL relatedness; function-word handling emerges as particularly impactful for D->M, while content-word swaps frequently degrade performance due to lexical and morphological complexity. The work highlights practical trade-offs, including script diversity, lexicon availability, and the need for family-specific tuning, offering a flexible recipe to make MT systems more dialect-robust and scalable to undocumented varieties.

Abstract

Most of the world's languages and dialects are low-resource, and lack support in mainstream machine translation (MT) models. However, many of them have a closely-related high-resource language (HRL) neighbor, and differ in linguistically regular ways from it. This underscores the importance of model robustness to dialectal variation and cross-lingual generalization to the HRL dialect continuum. We present DialUp, consisting of a training-time technique for adapting a pretrained model to dialectal data (M->D), and an inference-time intervention adapting dialectal data to the model expertise (D->M). M->D induces model robustness to potentially unseen and unknown dialects by exposure to synthetic data exemplifying linguistic mechanisms of dialectal variation, whereas D->M treats dialectal divergence for known target dialects. These methods show considerable performance gains for several dialects from four language families, and modest gains for two other language families. We also conduct feature and error analyses, which show that language varieties with low baseline MT performance are more likely to benefit from these approaches.

DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models

TL;DR

DialUp presents a principled, resource-light approach to extend pretrained HRL-based MT systems to closely related dialect continua. It combines two strategies: M->D, which trains on synthetic dialectal variations to improve robustness to unseen CRLs, and D->M, which at inference time re-aligns CRL input to HRL-like expressions using bilingual lexicons, with a focus on function words. Across 49 CRLs from six language families and two base models, both strategies yield substantial gains, with M->D offering broad, lower-variance improvements and D->M delivering notable benefits for select, low-baseline CRLs; the combination M<->D often performs best. Analyses show gains correlate with baseline performance and HRL-CRL relatedness; function-word handling emerges as particularly impactful for D->M, while content-word swaps frequently degrade performance due to lexical and morphological complexity. The work highlights practical trade-offs, including script diversity, lexicon availability, and the need for family-specific tuning, offering a flexible recipe to make MT systems more dialect-robust and scalable to undocumented varieties.

Abstract

Most of the world's languages and dialects are low-resource, and lack support in mainstream machine translation (MT) models. However, many of them have a closely-related high-resource language (HRL) neighbor, and differ in linguistically regular ways from it. This underscores the importance of model robustness to dialectal variation and cross-lingual generalization to the HRL dialect continuum. We present DialUp, consisting of a training-time technique for adapting a pretrained model to dialectal data (M->D), and an inference-time intervention adapting dialectal data to the model expertise (D->M). M->D induces model robustness to potentially unseen and unknown dialects by exposure to synthetic data exemplifying linguistic mechanisms of dialectal variation, whereas D->M treats dialectal divergence for known target dialects. These methods show considerable performance gains for several dialects from four language families, and modest gains for two other language families. We also conduct feature and error analyses, which show that language varieties with low baseline MT performance are more likely to benefit from these approaches.

Paper Structure

This paper contains 58 sections, 8 figures, 43 tables.

Figures (8)

  • Figure 1: Two paradigms for robustness to dialects on a continuum of distances from an HRL. DialUp involves M->D: training the model on artificial dialectal variation, and D->M: bringing dialectal data closer to model expectations (HRL-like input) at inference.
  • Figure 2: BLEU point improvement of the best DialUp method (M->D, D->M, or M<->D) over the best baseline (off-the-shelf, fthrl, or randaug). Languages are ordered by their M2M off-the-shelf performance.
  • Figure 3: BLEU score improvements over the best baseline with M2M for three language families. $\uparrow$ and $\downarrow$: # CRLs with positive/negative gains. M->D gives more consistent positive gains.
  • Figure 4: Decision trees indicate that the languages benefiting most from adaptation are low-baseline languages with less than 1.75 times HRL token fertility for Aya, and low-baseline languages with more than 24.2 chrF proximity to the HRL.
  • Figure 5: Gains in BLEU points for different values of $\theta^p$ (1-dimensional noiser) for Indic languages, with dotted lines showing the performance of M->D-cloud, using the the 3-dimensional noiser $\theta^{p,m,f}$ with default parameters. Tuning only $\theta^p$ for Indic is competitive with M->D-cloud.
  • ...and 3 more figures