Modeling Orthographic Variation in Occitan's Dialects

Zachary William Hopton; Noëmi Aepli

Modeling Orthographic Variation in Occitan's Dialects

Zachary William Hopton, Noëmi Aepli

TL;DR

The paper investigates whether large multilingual models can cope with orthographic and lexical variation across Occitan dialects without preprocessing normalization. By fine-tuning mBERT on a four-dialect Occitan corpus and evaluating via a parallel dialect lexicon, intrinsic analogy tasks, cross-dialect lexicon induction, and extrinsic PoS/UD parsing on the Tolosa Treebank, the authors assess cross-dialect representations and transferability. Results show limited gains in analogy and downstream tagging/parsing from multi-dialect fine-tuning, though cross-dialect lexicon induction improves and high surface similarity between dialects modestly enhances representations. The findings suggest that normalization may be unnecessary for such multilingual models in low-resource settings, while highlighting the importance of dialect coverage and surface similarity for effective cross-dialect transfer and future improvements through more closely matched pre-training data and model architectures.

Abstract

Effectively normalizing textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model's representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects. Intrinsic evaluations of the model's embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging and Universal Dependency parsing, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.

Modeling Orthographic Variation in Occitan's Dialects

TL;DR

Abstract

Paper Structure (21 sections, 2 equations, 4 figures, 7 tables)

This paper contains 21 sections, 2 equations, 4 figures, 7 tables.

Introduction
Linguistic Context
Related Work
Method
Creating a Dataset
Fine-Tuning mBERT
Experiments
Analogy Representation
Background
Results
Error Analysis
Lengadocian Lexicon Induction
Background
Results
Error Analysis
...and 6 more sections

Figures (4)

Figure 1: Dialect map of Occitan. The four dialects included in this study are highlighted, along with examples of lexical (i.e.,"mança" and "senèstra") and spelling (i.e., "bèu" and "beu") variation between the dialects.
Figure 2: Proportion of vocabulary items in each evaluation corpus that did not appear in the fine-tuning dataset. Red: Lengadocian; Blue: Gascon; Green: Lemosin; Purple: Provençau.
Figure 3: Examples of semantic and syntactic analogies from the Lemosin dataset with English translations in italics. INF: infinitive, PP: past participle.
Figure 4: Confusion matrix for PoS taggers when trained on data from all dialects (left) and only Lengadocian (right).

Modeling Orthographic Variation in Occitan's Dialects

TL;DR

Abstract

Modeling Orthographic Variation in Occitan's Dialects

Authors

TL;DR

Abstract

Table of Contents

Figures (4)