Table of Contents
Fetching ...

Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects

Verena Blaschke, Miriam Winkler, Barbara Plank

TL;DR

This work interrogates cross-dialect transfer from Standard German to German dialects by contrasting text-only, speech-only, and cascaded pipelines across intent and topic classification tasks. It introduces a dialectal audio intent dataset (German and Bavarian) and evaluates a broad set of encoders, ASR models, and dataset conditions to reveal modality-dependent patterns. Key findings show that speech-only systems excel on dialect data, text-only systems fare best on standard German, and cascaded systems can outperform text-only when ASR outputs are normalized toward standard German, though results vary by dialect. The study highlights the importance of speech data and ASR normalization for dialect NLP, offering practical guidance for dataset creation and deployment in dialect-rich, low-resource contexts.

Abstract

Research on cross-dialectal transfer from a standard to a non-standard dialect variety has typically focused on text data. However, dialects are primarily spoken, and non-standard spellings are known to cause issues in text processing. We compare standard-to-dialect transfer in three settings: text models, speech models, and cascaded systems where speech first gets automatically transcribed and then further processed by a text model. In our experiments, we focus on German and multiple German dialects in the context of written and spoken intent and topic classification. To that end, we release the first dialectal audio intent classification dataset. We find that the speech-only setup provides the best results on the dialect data while the text-only setup works best on the standard data. While the cascaded systems lag behind the text-only models for German, they perform relatively well on the dialectal data if the transcription system generates normalized, standard-like output.

Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects

TL;DR

This work interrogates cross-dialect transfer from Standard German to German dialects by contrasting text-only, speech-only, and cascaded pipelines across intent and topic classification tasks. It introduces a dialectal audio intent dataset (German and Bavarian) and evaluates a broad set of encoders, ASR models, and dataset conditions to reveal modality-dependent patterns. Key findings show that speech-only systems excel on dialect data, text-only systems fare best on standard German, and cascaded systems can outperform text-only when ASR outputs are normalized toward standard German, though results vary by dialect. The study highlights the importance of speech data and ASR normalization for dialect NLP, offering practical guidance for dataset creation and deployment in dialect-rich, low-resource contexts.

Abstract

Research on cross-dialectal transfer from a standard to a non-standard dialect variety has typically focused on text data. However, dialects are primarily spoken, and non-standard spellings are known to cause issues in text processing. We compare standard-to-dialect transfer in three settings: text models, speech models, and cascaded systems where speech first gets automatically transcribed and then further processed by a text model. In our experiments, we focus on German and multiple German dialects in the context of written and spoken intent and topic classification. To that end, we release the first dialectal audio intent classification dataset. We find that the speech-only setup provides the best results on the dialect data while the text-only setup works best on the standard data. While the cascaded systems lag behind the text-only models for German, they perform relatively well on the dialectal data if the transcription system generates normalized, standard-like output.

Paper Structure

This paper contains 54 sections, 1 figure, 11 tables.

Figures (1)

  • Figure 1: We compare three evaluation setups for German and dialectal text and speech data.