Modeling Orthographic Variation in Occitan's Dialects
Zachary William Hopton, Noëmi Aepli
TL;DR
The paper investigates whether large multilingual models can cope with orthographic and lexical variation across Occitan dialects without preprocessing normalization. By fine-tuning mBERT on a four-dialect Occitan corpus and evaluating via a parallel dialect lexicon, intrinsic analogy tasks, cross-dialect lexicon induction, and extrinsic PoS/UD parsing on the Tolosa Treebank, the authors assess cross-dialect representations and transferability. Results show limited gains in analogy and downstream tagging/parsing from multi-dialect fine-tuning, though cross-dialect lexicon induction improves and high surface similarity between dialects modestly enhances representations. The findings suggest that normalization may be unnecessary for such multilingual models in low-resource settings, while highlighting the importance of dialect coverage and surface similarity for effective cross-dialect transfer and future improvements through more closely matched pre-training data and model architectures.
Abstract
Effectively normalizing textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model's representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects. Intrinsic evaluations of the model's embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging and Universal Dependency parsing, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
