Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties
Ekaterina Artemova, Verena Blaschke, Barbara Plank
TL;DR
The paper tackles the robustness of cross-lingual task-oriented dialogue systems when faced with colloquial German and UAAVE English. It introduces a rule-based perturbation framework with 18 German and 118 English perturbations to simulate dialectal morphosyntactic variation and evaluates six transformer-based encoders across four ToD datasets. The findings show that while intent recognition remains relatively stable under perturbations, slot filling experiences substantial degradation, especially when many perturbations are applied concurrently; fine-tuning with in-language data improves robustness. These results underscore the need for dialect-aware evaluation and targeted robustness strategies in ToD systems, and the authors provide a perturbation toolkit and results to support ongoing research and development.
Abstract
Mainstream cross-lingual task-oriented dialogue (ToD) systems leverage the transfer learning paradigm by training a joint model for intent recognition and slot-filling in English and applying it, zero-shot, to other languages. We address a gap in prior research, which often overlooked the transfer to lower-resource colloquial varieties due to limited test data. Inspired by prior work on English varieties, we craft and manually evaluate perturbation rules that transform German sentences into colloquial forms and use them to synthesize test sets in four ToD datasets. Our perturbation rules cover 18 distinct language phenomena, enabling us to explore the impact of each perturbation on slot and intent performance. Using these new datasets, we conduct an experimental evaluation across six different transformers. Here, we demonstrate that when applied to colloquial varieties, ToD systems maintain their intent recognition performance, losing 6% (4.62 percentage points) in accuracy on average. However, they exhibit a significant drop in slot detection, with a decrease of 31% (21 percentage points) in slot F1 score. Our findings are further supported by a transfer experiment from Standard American English to synthetic Urban African American Vernacular English.
