Table of Contents
Fetching ...

Modeling Orthographic Variation in Occitan's Dialects

Zachary William Hopton, Noëmi Aepli

TL;DR

The paper investigates whether large multilingual models can cope with orthographic and lexical variation across Occitan dialects without preprocessing normalization. By fine-tuning mBERT on a four-dialect Occitan corpus and evaluating via a parallel dialect lexicon, intrinsic analogy tasks, cross-dialect lexicon induction, and extrinsic PoS/UD parsing on the Tolosa Treebank, the authors assess cross-dialect representations and transferability. Results show limited gains in analogy and downstream tagging/parsing from multi-dialect fine-tuning, though cross-dialect lexicon induction improves and high surface similarity between dialects modestly enhances representations. The findings suggest that normalization may be unnecessary for such multilingual models in low-resource settings, while highlighting the importance of dialect coverage and surface similarity for effective cross-dialect transfer and future improvements through more closely matched pre-training data and model architectures.

Abstract

Effectively normalizing textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model's representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects. Intrinsic evaluations of the model's embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging and Universal Dependency parsing, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.

Modeling Orthographic Variation in Occitan's Dialects

TL;DR

The paper investigates whether large multilingual models can cope with orthographic and lexical variation across Occitan dialects without preprocessing normalization. By fine-tuning mBERT on a four-dialect Occitan corpus and evaluating via a parallel dialect lexicon, intrinsic analogy tasks, cross-dialect lexicon induction, and extrinsic PoS/UD parsing on the Tolosa Treebank, the authors assess cross-dialect representations and transferability. Results show limited gains in analogy and downstream tagging/parsing from multi-dialect fine-tuning, though cross-dialect lexicon induction improves and high surface similarity between dialects modestly enhances representations. The findings suggest that normalization may be unnecessary for such multilingual models in low-resource settings, while highlighting the importance of dialect coverage and surface similarity for effective cross-dialect transfer and future improvements through more closely matched pre-training data and model architectures.

Abstract

Effectively normalizing textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model's representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects. Intrinsic evaluations of the model's embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging and Universal Dependency parsing, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
Paper Structure (21 sections, 2 equations, 4 figures, 7 tables)

This paper contains 21 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Dialect map of Occitan. The four dialects included in this study are highlighted, along with examples of lexical (i.e.,"mança" and "senèstra") and spelling (i.e., "bèu" and "beu") variation between the dialects.
  • Figure 2: Proportion of vocabulary items in each evaluation corpus that did not appear in the fine-tuning dataset. Red: Lengadocian; Blue: Gascon; Green: Lemosin; Purple: Provençau.
  • Figure 3: Examples of semantic and syntactic analogies from the Lemosin dataset with English translations in italics. INF: infinitive, PP: past participle.
  • Figure 4: Confusion matrix for PoS taggers when trained on data from all dialects (left) and only Lengadocian (right).