Table of Contents
Fetching ...

AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in Dialectal Arabic

Nathaniel R. Robinson, Shahd Abdelmoneim, Kelly Marchisio, Sebastian Ruder

TL;DR

This work introduces AL-QASIDA, a comprehensive framework for evaluating large language models on Dialectal Arabic proficiency across four competencies: fidelity, understanding, quality, and diglossia. By combining NADI and ALDi-based dialect identification into an ADI2 metric, plus multilingual and monolingual evaluation data (cross-lingual prompts, DA prompts, and bitext/monotext corpora), the authors quantify how well LLMs identify, generate, and translate DA across eight varieties. Key findings reveal that LLMs generally understand DA better than they generate it, often defaulting to Modern Standard Arabic and showing limited cross-lingual dialect transfer, with post-training bias contributing to these gaps. The study also shows few-shot prompting can mitigate deficits at low cost, and it provides concrete recommendations (e.g., GPT-4o for cross-lingual Egyptian/Moroccan requests, Llama-3.1 for monolingual generation) as well as ethical considerations and future directions for making DA technologies more equitable.

Abstract

Dialectal Arabic (DA) varieties are under-served by language technologies, particularly large language models (LLMs). This trend threatens to exacerbate existing social inequalities and limits LLM applications, yet the research community lacks operationalized performance measurements in DA. We present a framework that comprehensively assesses LLMs' DA modeling capabilities across four dimensions: fidelity, understanding, quality, and diglossia. We evaluate nine LLMs in eight DA varieties and provide practical recommendations. Our evaluation suggests that LLMs do not produce DA as well as they understand it, not because their DA fluency is poor, but because they are reluctant to generate DA. Further analysis suggests that current post-training can contribute to bias against DA, that few-shot examples can overcome this deficiency, and that otherwise no measurable features of input text correlate well with LLM DA performance.

AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in Dialectal Arabic

TL;DR

This work introduces AL-QASIDA, a comprehensive framework for evaluating large language models on Dialectal Arabic proficiency across four competencies: fidelity, understanding, quality, and diglossia. By combining NADI and ALDi-based dialect identification into an ADI2 metric, plus multilingual and monolingual evaluation data (cross-lingual prompts, DA prompts, and bitext/monotext corpora), the authors quantify how well LLMs identify, generate, and translate DA across eight varieties. Key findings reveal that LLMs generally understand DA better than they generate it, often defaulting to Modern Standard Arabic and showing limited cross-lingual dialect transfer, with post-training bias contributing to these gaps. The study also shows few-shot prompting can mitigate deficits at low cost, and it provides concrete recommendations (e.g., GPT-4o for cross-lingual Egyptian/Moroccan requests, Llama-3.1 for monolingual generation) as well as ethical considerations and future directions for making DA technologies more equitable.

Abstract

Dialectal Arabic (DA) varieties are under-served by language technologies, particularly large language models (LLMs). This trend threatens to exacerbate existing social inequalities and limits LLM applications, yet the research community lacks operationalized performance measurements in DA. We present a framework that comprehensively assesses LLMs' DA modeling capabilities across four dimensions: fidelity, understanding, quality, and diglossia. We evaluate nine LLMs in eight DA varieties and provide practical recommendations. Our evaluation suggests that LLMs do not produce DA as well as they understand it, not because their DA fluency is poor, but because they are reluctant to generate DA. Further analysis suggests that current post-training can contribute to bias against DA, that few-shot examples can overcome this deficiency, and that otherwise no measurable features of input text correlate well with LLM DA performance.

Paper Structure

This paper contains 16 sections, 1 equation, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Arabic greater dialectal regions, per diab-habash-2007-arabic. Stars indicate the eight nations whose DA varieties are represented in this work.
  • Figure 2: A sentence shared across Syrian, Jordanian, and Palestinian varieties may be labeled as Jordanian but predicted as Syrian, resulting in a false NADI error.
  • Figure 3: Llama models and Command series base models are best at maintaining the user's DA variety, as measured by ADI2 score (bars) and macro-score (marks).
  • Figure 4: ADI2 (correct-variety dialectness scores) distributions across LLMs and genres in the crosslingual task (which requests specific DA varieties of the LLM in English). ADI2=0 indicates the wrong Arabic variety.
  • Figure 5: DA$\rightarrow$Eng MT surpasses Eng$\rightarrow$DA. DA$\leftrightarrow$MSA scores are low in the BTEC genre and rarely above the dotted-line zero-translate SpBLEU baseline for FLORES. Bars represent SpBLEU, while marks are chrF. Scores are between 0 and 1. (i.e. 0.5 corresponds to 50 SpBLEU points.) Note dza is the country code for Algeria.
  • ...and 7 more figures