Table of Contents
Fetching ...

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

Gianluca Vico, Jindřich Libovický

TL;DR

This work introduces a crowdsourced Piedmontese dataset capturing non-standard orthography by collecting Italian-to-Piedmontese translations with natural spellings. It benchmarks multiple LLMs on tokenization parity, word alignment, topic classification, and machine translation, revealing a tokenization penalty for Piedmontese yet generally competent cross-lingual classification. Translation in the forward direction (Piedmontese to high-resource languages) is more successful than generation into Piedmontese, which remains challenging without standard orthography. The dataset and evaluation code are publicly released, enabling broader study of low-resource, orthography-variant languages and informing NLP model development for endangered languages.

Abstract

We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

TL;DR

This work introduces a crowdsourced Piedmontese dataset capturing non-standard orthography by collecting Italian-to-Piedmontese translations with natural spellings. It benchmarks multiple LLMs on tokenization parity, word alignment, topic classification, and machine translation, revealing a tokenization penalty for Piedmontese yet generally competent cross-lingual classification. Translation in the forward direction (Piedmontese to high-resource languages) is more successful than generation into Piedmontese, which remains challenging without standard orthography. The dataset and evaluation code are publicly released, enabling broader study of low-resource, orthography-variant languages and informing NLP model development for endangered languages.

Abstract

We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.
Paper Structure (24 sections, 4 equations, 6 figures, 30 tables)

This paper contains 24 sections, 4 equations, 6 figures, 30 tables.

Figures (6)

  • Figure 1: On the left, the main language used by the annotators; Icelandic is included in Other. On the right, the self-reported proficiency in Piedmontese. The majority of people uses Italian and self-reports perfect or fair proficiency in Piedmontese.
  • Figure 2: Age distribution of the annotators. Most annotators are 20-30 years old, while older people are more likely to know Piedmontese, but we did not reach them.
  • Figure 3: Parity scores with respect to English and Italian. Piedmontese has worse parity compared to the other languages; however, it is closer to one when compared to Italian.
  • Figure 4: This is a sample alignment. The gray background is the reference alignment, while eflomal alignment is represented by the blue circles and SimAlign one by the green crosses. The English translation of the sentence is This seems sensible, because the Earth doesn't feel as if it's moving, does it?
  • Figure 5: Comparison on the F1 scores of the models in the topic classification task. We perform bootstrapping to compute the confidence interval of the scores.
  • ...and 1 more figures