Table of Contents
Fetching ...

From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models

Abdulmuizz Khalak, Abderrahmane Issam, Gerasimos Spanakis

TL;DR

The paper investigates cross-dialect transfer in Arabic language models by combining probing on sentiment analysis, named entity recognition, and part-of-speech tagging with Representational Similarity Analysis using Centered Kernel Alignment on the MADAR parallel corpus. It demonstrates that transfer from Modern Standard Arabic ($MSA$)–centric models to dialects is possible but uneven, influenced by geographic proximity and pretraining data size, and it reveals instances of negative interference in multi-dialect pretraining. The study provides a dual perspective by measuring functional transfer via probing and intrinsic similarity via RSA, linking both to dialectal geography and corpus scale. These findings inform the design of dialect-aware Arabic LMs and highlight the need for targeted data collection and potentially dialect-specific parameters to optimize cross-dialect transfer.

Abstract

Arabic Language Models (LMs) are pretrained predominately on Modern Standard Arabic (MSA) and are expected to transfer to its dialects. While MSA as the standard written variety is commonly used in formal settings, people speak and write online in various dialects that are spread across the Arab region. This poses limitations for Arabic LMs, since its dialects vary in their similarity to MSA. In this work we study cross-lingual transfer of Arabic models using probing on 3 Natural Language Processing (NLP) Tasks, and representational similarity. Our results indicate that transfer is possible but disproportionate across dialects, which we find to be partially explained by their geographic proximity. Furthermore, we find evidence for negative interference in models trained to support all Arabic dialects. This questions their degree of similarity, and raises concerns for cross-lingual transfer in Arabic models.

From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models

TL;DR

The paper investigates cross-dialect transfer in Arabic language models by combining probing on sentiment analysis, named entity recognition, and part-of-speech tagging with Representational Similarity Analysis using Centered Kernel Alignment on the MADAR parallel corpus. It demonstrates that transfer from Modern Standard Arabic ()–centric models to dialects is possible but uneven, influenced by geographic proximity and pretraining data size, and it reveals instances of negative interference in multi-dialect pretraining. The study provides a dual perspective by measuring functional transfer via probing and intrinsic similarity via RSA, linking both to dialectal geography and corpus scale. These findings inform the design of dialect-aware Arabic LMs and highlight the need for targeted data collection and potentially dialect-specific parameters to optimize cross-dialect transfer.

Abstract

Arabic Language Models (LMs) are pretrained predominately on Modern Standard Arabic (MSA) and are expected to transfer to its dialects. While MSA as the standard written variety is commonly used in formal settings, people speak and write online in various dialects that are spread across the Arab region. This poses limitations for Arabic LMs, since its dialects vary in their similarity to MSA. In this work we study cross-lingual transfer of Arabic models using probing on 3 Natural Language Processing (NLP) Tasks, and representational similarity. Our results indicate that transfer is possible but disproportionate across dialects, which we find to be partially explained by their geographic proximity. Furthermore, we find evidence for negative interference in models trained to support all Arabic dialects. This questions their degree of similarity, and raises concerns for cross-lingual transfer in Arabic models.
Paper Structure (25 sections, 3 equations, 7 figures, 2 tables)

This paper contains 25 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Architecture of the probing classifier for the example sentence “The boy is eating the apple now.” Sentence representations pass through N layers, and each layer is probed using the classifier in Eq. \ref{['eq:Probe']}.
  • Figure 2: Architecture of CKA for representation similarity. MADAR parallel sentences are encoded by MSA and DA encoders through N layers, and the resulting representations are compared using linear CKA (Eq. \ref{['eq:cka_linear']}).
  • Figure 3: Performance of best performing layer on MSA Tasks.
  • Figure 4: Impact of pretraining corpus size on probe performance across tasks. The Percentile Rank of the number of tokens is displayed for better visual interpretation.
  • Figure 5: Relative performance of general vs. dialect-specific Arabic models on native dialectal datasets. Points to the right of the reference line denote cases where the dialect-specific model achieves higher performance, while points to the left indicate that the general model remains superior.
  • ...and 2 more figures