Improving Multi-turn Task Completion in Task-Oriented Dialog Systems via Prompt Chaining and Fine-Grained Feedback
Moghis Fereidouni, Md Sajid Ahmed, Adib Mosharrof, A. B. Siddique
TL;DR
RealTOD tackles the unreliability of multi-turn task completion in LLM-powered TOD systems by introducing prompt chaining to enable zero-shot generalization across domains and a fine-grained API feedback loop to ensure API calls adhere to domain schemas. The two-stage prompt chaining (example dialog generation and task adaptation) plus schema-aware API verification significantly improve Full API Call Accuracy and related sub-metrics across SGD and BiToD benchmarks, as demonstrated with four LLMs and a trained user simulator. Human evaluations corroborate automated metrics, showing gains in task completion, fluency, and informativeness, while ablation confirms both components are beneficial, with prompt chaining providing notable boosts. The work highlights practical implications for scalable TOD deployment, though it also reveals persistent challenges in multi-domain coverage and long-horizon planning that warrant further research.
Abstract
Task-oriented dialog (TOD) systems facilitate users in accomplishing complex, multi-turn tasks through natural language. While instruction-tuned large language models (LLMs) have demonstrated strong performance on a range of single-turn NLP tasks, they often struggle with reliable multi-turn task completion in TOD settings, particularly when generating API calls required to interact with external systems. To address this, we introduce RealTOD, a novel framework that improves LLM-based TOD systems through (1) prompt chaining and (2) fine-grained feedback. Prompt chaining enables zero-shot generalization to new domains by automatically synthesizing a schema-aligned in-context example for the target task. Fine-grained feedback verifies each generated API call against the domain schema, identifies specific errors, and provides targeted correction prompts. To evaluate task completion reliability, we introduce full API Call Accuracy as a robust metric, along with detailed sub-metrics to capture common failure modes. We conduct extensive experiments on the SGD and BiTOD benchmarks using four LLMs. RealTOD improves Full API accuracy, surpassing state-of-the-art AutoTOD by 37.10% on SGD and supervised learning-based baseline SimpleTOD by 10.32% on BiTOD. Human evaluations further confirm that LLMs integrated with RealTOD achieve superior task completion, fluency, and informativeness compared to existing methods.
