Table of Contents
Fetching ...

Improving Dialectal Slot and Intent Detection with Auxiliary Tasks: A Multi-Dialectal Bavarian Case Study

Xaver Maria Krückl, Verena Blaschke, Barbara Plank

TL;DR

This study tackles the challenge of slot and intent detection (SID) for Bavarian dialects under zero-shot transfer by fine-tuning encoder-only PLMs on English SID data and evaluating on Bavarian test sets, including a newly released Munich Bavarian dataset. It systematically compares baseline, multi-task learning (MTL), and intermediate-task training (ITT) setups, using three Bavarian auxiliary tasks—syntactic dependencies/POS (UD), NER (BarNER), and masked language modeling (MLM)—to analyze cross-dialect transfer. The findings show that auxiliary tasks predominantly improve slot filling, with NER providing the strongest gains, and that ITT yields more consistent improvements than MTL, achieving up to +5.1pp in intent accuracy and +8.4pp in slot F1 on Bavarian data (best model: MLM×NER→SID). Across Bavarian variants and additional dialect data (Swiss German, Standard German, English), the results suggest robust transfer patterns with some dialect-specific differences, and reveal the value of the new Munich dataset for evaluating intra-dialect variation. The work contributes practical guidance for dialectal SID via auxiliary tasks and ITT, releases valuable data, and provides open-source tooling for cross-dialect NLU research in digital assistants.

Abstract

Reliable slot and intent detection (SID) is crucial in natural language understanding for applications like digital assistants. Encoder-only transformer models fine-tuned on high-resource languages generally perform well on SID. However, they struggle with dialectal data, where no standardized form exists and training data is scarce and costly to produce. We explore zero-shot transfer learning for SID, focusing on multiple Bavarian dialects, for which we release a new dataset for the Munich dialect. We evaluate models trained on auxiliary tasks in Bavarian, and compare joint multi-task learning with intermediate-task training. We also compare three types of auxiliary tasks: token-level syntactic tasks, named entity recognition (NER), and language modelling. We find that the included auxiliary tasks have a more positive effect on slot filling than intent classification (with NER having the most positive effect), and that intermediate-task training yields more consistent performance gains. Our best-performing approach improves intent classification performance on Bavarian dialects by 5.1 and slot filling F1 by 8.4 percentage points.

Improving Dialectal Slot and Intent Detection with Auxiliary Tasks: A Multi-Dialectal Bavarian Case Study

TL;DR

This study tackles the challenge of slot and intent detection (SID) for Bavarian dialects under zero-shot transfer by fine-tuning encoder-only PLMs on English SID data and evaluating on Bavarian test sets, including a newly released Munich Bavarian dataset. It systematically compares baseline, multi-task learning (MTL), and intermediate-task training (ITT) setups, using three Bavarian auxiliary tasks—syntactic dependencies/POS (UD), NER (BarNER), and masked language modeling (MLM)—to analyze cross-dialect transfer. The findings show that auxiliary tasks predominantly improve slot filling, with NER providing the strongest gains, and that ITT yields more consistent improvements than MTL, achieving up to +5.1pp in intent accuracy and +8.4pp in slot F1 on Bavarian data (best model: MLM×NER→SID). Across Bavarian variants and additional dialect data (Swiss German, Standard German, English), the results suggest robust transfer patterns with some dialect-specific differences, and reveal the value of the new Munich dataset for evaluating intra-dialect variation. The work contributes practical guidance for dialectal SID via auxiliary tasks and ITT, releases valuable data, and provides open-source tooling for cross-dialect NLU research in digital assistants.

Abstract

Reliable slot and intent detection (SID) is crucial in natural language understanding for applications like digital assistants. Encoder-only transformer models fine-tuned on high-resource languages generally perform well on SID. However, they struggle with dialectal data, where no standardized form exists and training data is scarce and costly to produce. We explore zero-shot transfer learning for SID, focusing on multiple Bavarian dialects, for which we release a new dataset for the Munich dialect. We evaluate models trained on auxiliary tasks in Bavarian, and compare joint multi-task learning with intermediate-task training. We also compare three types of auxiliary tasks: token-level syntactic tasks, named entity recognition (NER), and language modelling. We find that the included auxiliary tasks have a more positive effect on slot filling than intent classification (with NER having the most positive effect), and that intermediate-task training yields more consistent performance gains. Our best-performing approach improves intent classification performance on Bavarian dialects by 5.1 and slot filling F1 by 8.4 percentage points.
Paper Structure (60 sections, 4 figures, 10 tables)

This paper contains 60 sections, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Overview of evaluated setups. We fine-tune pre-trained language models (PLMs) on English SID data (grey ) and evaluate them on Bavarian (red ). We compare multiple setups: a) no auxiliary tasks, b) multi-task learning by jointly training on English SID data and Bavarian auxiliary tasks ("aux"), c) intermediate-task training on Bavarian, then fine-tuning on English SID data.
  • Figure 2: The Upper German dialect groups Bavarian (blue, right) and Alemannic (green, left), based on wiesinger1983deutschedialekte. The red dots show the xSID datasets included in this study and our new dataset, de-muc.
  • Figure 3: Slot and intent detection results for the different models, in %. The results are averaged over the three Bavarian dialect test sets and three random seeds (standard deviations shown as error bars). Mean scores and standard deviations per individual dialect are in Appendix \ref{['sec:appendix-details']}. The dashed lines denote the scores of the baseline model (no auxiliary tasks). The setups with auxiliary tasks also use mDeBERTa. The three pale entries at the top are worse-performing baseline models with alternative PLMs.
  • Figure 4: Intent (top) and slot (bottom) scores show similar patterns across experimental set-ups for the test varieties. The scores are averaged across three random seeds (more details are in Appendix \ref{['sec:appendix-details']}). The pale sections to the left show the scores of baseline models with different PLMs. We use lines despite the categorical nature of the x-axis to make the plots easier to compare.