Table of Contents
Fetching ...

Speech Analysis of Language Varieties in Italy

Moreno La Quatra, Alkis Koudounas, Elena Baralis, Sabato Marco Siniscalchi

TL;DR

Addresses automatic region identification of Italian language varieties from speech using self-supervised representations. It leverages multilingual pre-trained speech models and supervised contrastive losses to learn discriminative embeddings, evaluated on the VIVALDI corpus. The results show that pre-training plus contrastive fine-tuning, particularly with a multi-similarity objective, yields the strongest region separation, with a macro F1 around $51\%$. Embedding visualizations and confusion analyses illuminate both robust discrimination and persistent confusion between geographically close regions, informing linguists about inter-regional relationships and evolution. The findings advance methodology for fine-grained language-variety discrimination and support applications in documentation and education.

Abstract

Italy exhibits rich linguistic diversity across its territory due to the distinct regional languages spoken in different areas. Recent advances in self-supervised learning provide new opportunities to analyze Italy's linguistic varieties using speech data alone. This includes the potential to leverage representations learned from large amounts of data to better examine nuances between closely related linguistic varieties. In this study, we focus on automatically identifying the geographic region of origin of speech samples drawn from Italy's diverse language varieties. We leverage self-supervised learning models to tackle this task and analyze differences and similarities between Italy's regional languages. In doing so, we also seek to uncover new insights into the relationships among these diverse yet closely related varieties, which may help linguists understand their interconnected evolution and regional development over time and space. To improve the discriminative ability of learned representations, we evaluate several supervised contrastive learning objectives, both as pre-training steps and additional fine-tuning objectives. Experimental evidence shows that pre-trained self-supervised models can effectively identify regions from speech recording. Additionally, incorporating contrastive objectives during fine-tuning improves classification accuracy and yields embeddings that distinctly separate regional varieties, demonstrating the value of combining self-supervised pre-training and contrastive learning for this task.

Speech Analysis of Language Varieties in Italy

TL;DR

Addresses automatic region identification of Italian language varieties from speech using self-supervised representations. It leverages multilingual pre-trained speech models and supervised contrastive losses to learn discriminative embeddings, evaluated on the VIVALDI corpus. The results show that pre-training plus contrastive fine-tuning, particularly with a multi-similarity objective, yields the strongest region separation, with a macro F1 around . Embedding visualizations and confusion analyses illuminate both robust discrimination and persistent confusion between geographically close regions, informing linguists about inter-regional relationships and evolution. The findings advance methodology for fine-grained language-variety discrimination and support applications in documentation and education.

Abstract

Italy exhibits rich linguistic diversity across its territory due to the distinct regional languages spoken in different areas. Recent advances in self-supervised learning provide new opportunities to analyze Italy's linguistic varieties using speech data alone. This includes the potential to leverage representations learned from large amounts of data to better examine nuances between closely related linguistic varieties. In this study, we focus on automatically identifying the geographic region of origin of speech samples drawn from Italy's diverse language varieties. We leverage self-supervised learning models to tackle this task and analyze differences and similarities between Italy's regional languages. In doing so, we also seek to uncover new insights into the relationships among these diverse yet closely related varieties, which may help linguists understand their interconnected evolution and regional development over time and space. To improve the discriminative ability of learned representations, we evaluate several supervised contrastive learning objectives, both as pre-training steps and additional fine-tuning objectives. Experimental evidence shows that pre-trained self-supervised models can effectively identify regions from speech recording. Additionally, incorporating contrastive objectives during fine-tuning improves classification accuracy and yields embeddings that distinctly separate regional varieties, demonstrating the value of combining self-supervised pre-training and contrastive learning for this task.
Paper Structure (15 sections, 3 equations, 2 figures, 3 tables)

This paper contains 15 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: t-SNE visualization of the original XLSR-53-ITA model (a) and the corresponding pre-trained versions with the three different contrastive learning objectives: supervised contrastive loss (b), triplet-margin loss (c), and multi-similarity loss (d).
  • Figure 2: Confusion Matrix (a) and t-SNE (b) of the XLSR-53-ITA model w/ multi-task fine-tuning using the multi-similarity contrastive objective.