Table of Contents
Fetching ...

Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties

Ekaterina Artemova, Verena Blaschke, Barbara Plank

TL;DR

The paper tackles the robustness of cross-lingual task-oriented dialogue systems when faced with colloquial German and UAAVE English. It introduces a rule-based perturbation framework with 18 German and 118 English perturbations to simulate dialectal morphosyntactic variation and evaluates six transformer-based encoders across four ToD datasets. The findings show that while intent recognition remains relatively stable under perturbations, slot filling experiences substantial degradation, especially when many perturbations are applied concurrently; fine-tuning with in-language data improves robustness. These results underscore the need for dialect-aware evaluation and targeted robustness strategies in ToD systems, and the authors provide a perturbation toolkit and results to support ongoing research and development.

Abstract

Mainstream cross-lingual task-oriented dialogue (ToD) systems leverage the transfer learning paradigm by training a joint model for intent recognition and slot-filling in English and applying it, zero-shot, to other languages. We address a gap in prior research, which often overlooked the transfer to lower-resource colloquial varieties due to limited test data. Inspired by prior work on English varieties, we craft and manually evaluate perturbation rules that transform German sentences into colloquial forms and use them to synthesize test sets in four ToD datasets. Our perturbation rules cover 18 distinct language phenomena, enabling us to explore the impact of each perturbation on slot and intent performance. Using these new datasets, we conduct an experimental evaluation across six different transformers. Here, we demonstrate that when applied to colloquial varieties, ToD systems maintain their intent recognition performance, losing 6% (4.62 percentage points) in accuracy on average. However, they exhibit a significant drop in slot detection, with a decrease of 31% (21 percentage points) in slot F1 score. Our findings are further supported by a transfer experiment from Standard American English to synthetic Urban African American Vernacular English.

Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties

TL;DR

The paper tackles the robustness of cross-lingual task-oriented dialogue systems when faced with colloquial German and UAAVE English. It introduces a rule-based perturbation framework with 18 German and 118 English perturbations to simulate dialectal morphosyntactic variation and evaluates six transformer-based encoders across four ToD datasets. The findings show that while intent recognition remains relatively stable under perturbations, slot filling experiences substantial degradation, especially when many perturbations are applied concurrently; fine-tuning with in-language data improves robustness. These results underscore the need for dialect-aware evaluation and targeted robustness strategies in ToD systems, and the authors provide a perturbation toolkit and results to support ongoing research and development.

Abstract

Mainstream cross-lingual task-oriented dialogue (ToD) systems leverage the transfer learning paradigm by training a joint model for intent recognition and slot-filling in English and applying it, zero-shot, to other languages. We address a gap in prior research, which often overlooked the transfer to lower-resource colloquial varieties due to limited test data. Inspired by prior work on English varieties, we craft and manually evaluate perturbation rules that transform German sentences into colloquial forms and use them to synthesize test sets in four ToD datasets. Our perturbation rules cover 18 distinct language phenomena, enabling us to explore the impact of each perturbation on slot and intent performance. Using these new datasets, we conduct an experimental evaluation across six different transformers. Here, we demonstrate that when applied to colloquial varieties, ToD systems maintain their intent recognition performance, losing 6% (4.62 percentage points) in accuracy on average. However, they exhibit a significant drop in slot detection, with a decrease of 31% (21 percentage points) in slot F1 score. Our findings are further supported by a transfer experiment from Standard American English to synthetic Urban African American Vernacular English.
Paper Structure (32 sections, 7 figures, 8 tables)

This paper contains 32 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: An illustrative example selected from xSID. The top part displays the intact sentence with gold labels, the bottom part shows the prediction for the perturbed sentence. The perturbations tun_imperative, article_name, name_order are applied. There are errors in predicting the intent and one of the two slots.
  • Figure 2: Intent prediction success rates on the perturbed German test set on MASSIVE with respect to most impactful individual perturbations. The grey bars denote the count of perturbed sentences, the colored bars show the success rate. A logarithmic scale is used.
  • Figure 3: The $\Delta$ slot F$_1$ score of the best performing mDeBERTa with respect to perturbation category in perturbed German test set in four datasets. $\Delta$ denotes the difference in F$_1$ score between performance on intact and perturbed data.
  • Figure 4: The success rates in intent prediction on the perturbed English tests sets with respect to individual perturbations. The grey bars represent the perturbation frequency (i.e., the count of altered sentences), while the colored bars indicate the success rate (i.e., the number of misclassified sentences after applying the perturbation). A logarithmic scale is utilized for improved clarity.
  • Figure 5: The success rates in intent prediction on the perturbed German tests sets with respect to individual perturbations. The grey bars represent the perturbation frequency (i.e., the count of altered sentences), while the colored bars indicate the success rate (i.e., the number of misclassified sentences after applying the perturbation). A logarithmic scale is utilized for improved clarity.
  • ...and 2 more figures