Table of Contents
Fetching ...

Evaluating and Enhancing Out-of-Domain Generalization of Task-Oriented Dialog Systems for Task Completion without Turn-level Dialog Annotations

Adib Mosharrof, Moghis Fereidouni, A. B. Siddique

TL;DR

This work shows that task-oriented dialog systems can be effectively trained purely on natural dialogs without turn-level annotations by framing ToD as multi-task instruction fine-tuning conditioned on domain schemas. It introduces schema augmentation to diversify schemas and improve out-of-domain API call accuracy, achieving strong task completion even in unseen domains. Compared with large proprietary LLMs, fine-tuned ZeroToD models deliver superior or competitive API call accuracy while remaining cost-efficient, and human evaluations corroborate automatic metrics. The results indicate a practical path toward scalable, zero-shot generalizable ToD systems that operate without costly sentence-level annotations.

Abstract

Traditional task-oriented dialog (ToD) systems rely heavily on labor-intensive turn-level annotations, such as dialogue states and policy labels, for training. This work explores whether large language models (LLMs) can be fine-tuned solely on natural language dialogs to perform ToD tasks, without requiring such annotations. We evaluate their ability to generalize to unseen domains and compare their performance with models trained on fully annotated data. Through extensive experiments with three open-source LLMs of varying sizes and two diverse ToD datasets, we find that models fine-tuned without turn-level annotations generate coherent and contextually appropriate responses. However, their task completion performance - measured by accurate execution of API calls - remains suboptimal, with the best models achieving only around 53% success in unseen domains. To improve task completion, we propose ZeroToD, a framework that incorporates a schema augmentation mechanism to enhance API call accuracy and overall task completion rates, particularly in out-of-domain settings. We also compare ZeroToD with fine-tuning-free alternatives, such as prompting off-the-shelf LLMs, and find that our framework enables smaller, fine-tuned models that outperform large-scale proprietary LLMs in task completion. Additionally, a human study evaluating informativeness, fluency, and task completion confirms our empirical findings. These findings suggest the feasibility of developing cost-effective, scalable, and zero-shot generalizable ToD systems for real-world applications.

Evaluating and Enhancing Out-of-Domain Generalization of Task-Oriented Dialog Systems for Task Completion without Turn-level Dialog Annotations

TL;DR

This work shows that task-oriented dialog systems can be effectively trained purely on natural dialogs without turn-level annotations by framing ToD as multi-task instruction fine-tuning conditioned on domain schemas. It introduces schema augmentation to diversify schemas and improve out-of-domain API call accuracy, achieving strong task completion even in unseen domains. Compared with large proprietary LLMs, fine-tuned ZeroToD models deliver superior or competitive API call accuracy while remaining cost-efficient, and human evaluations corroborate automatic metrics. The results indicate a practical path toward scalable, zero-shot generalizable ToD systems that operate without costly sentence-level annotations.

Abstract

Traditional task-oriented dialog (ToD) systems rely heavily on labor-intensive turn-level annotations, such as dialogue states and policy labels, for training. This work explores whether large language models (LLMs) can be fine-tuned solely on natural language dialogs to perform ToD tasks, without requiring such annotations. We evaluate their ability to generalize to unseen domains and compare their performance with models trained on fully annotated data. Through extensive experiments with three open-source LLMs of varying sizes and two diverse ToD datasets, we find that models fine-tuned without turn-level annotations generate coherent and contextually appropriate responses. However, their task completion performance - measured by accurate execution of API calls - remains suboptimal, with the best models achieving only around 53% success in unseen domains. To improve task completion, we propose ZeroToD, a framework that incorporates a schema augmentation mechanism to enhance API call accuracy and overall task completion rates, particularly in out-of-domain settings. We also compare ZeroToD with fine-tuning-free alternatives, such as prompting off-the-shelf LLMs, and find that our framework enables smaller, fine-tuned models that outperform large-scale proprietary LLMs in task completion. Additionally, a human study evaluating informativeness, fluency, and task completion confirms our empirical findings. These findings suggest the feasibility of developing cost-effective, scalable, and zero-shot generalizable ToD systems for real-world applications.

Paper Structure

This paper contains 18 sections, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Human Evaluation Study on SGD and KETOD. Evaluators were asked to rate the dialog samples between a range of 1-5 on 3 categories.
  • Figure 2: Multi-task instruction finetuning template. Items in blue are dynamic elements and those in purple are important aspects of the prompt.