Making Task-Oriented Dialogue Datasets More Natural by Synthetically Generating Indirect User Requests

Amogh Mannekote; Jinseok Nam; Ziming Li; Jian Gao; Kristy Elizabeth Boyer; Bonnie J. Dorr

Making Task-Oriented Dialogue Datasets More Natural by Synthetically Generating Indirect User Requests

Amogh Mannekote, Jinseok Nam, Ziming Li, Jian Gao, Kristy Elizabeth Boyer, Bonnie J. Dorr

TL;DR

The paper tackles the challenge of indirect user requests in task-oriented dialogue, which require pragmatic reasoning and world knowledge often lacking in smaller, deployed models. It introduces a five-stage, schema-guided data generation pipeline that leverages LLMs to create IndirectRequests, defines three linguistic criteria (Appropriateness, Unambiguity, World-Understanding), and crowdsources high-quality labels. The authors release IndirectRequests, based on SGD schemas, as a testbed to evaluate NLU and DST models in handling indirectness, and they show that even strong DST systems experience degradation on these utterances. They also propose automated proxy evaluators across small LMs, proprietary LLMs, and open-source LLMs to approximate human judgments, demonstrating the practical utility of their approach for robust model evaluation and domain transfer in virtual assistants. Overall, the work provides a scalable, reusable methodology for generating realistic indirect requests and a challenging benchmark to drive improvements in small-model robustness for real-world dialogue systems.

Abstract

Indirect User Requests (IURs), such as "It's cold in here" instead of "Could you please increase the temperature?" are common in human-human task-oriented dialogue and require world knowledge and pragmatic reasoning from the listener. While large language models (LLMs) can handle these requests effectively, smaller models deployed on virtual assistants often struggle due to resource constraints. Moreover, existing task-oriented dialogue benchmarks lack sufficient examples of complex discourse phenomena such as indirectness. To address this, we propose a set of linguistic criteria along with an LLM-based pipeline for generating realistic IURs to test natural language understanding (NLU) and dialogue state tracking (DST) models before deployment in a new domain. We also release IndirectRequests, a dataset of IURs based on the Schema Guided Dialog (SGD) corpus, as a comparative testbed for evaluating the performance of smaller models in handling indirect requests.

Making Task-Oriented Dialogue Datasets More Natural by Synthetically Generating Indirect User Requests

TL;DR

Abstract

Paper Structure (36 sections, 5 figures, 5 tables)

This paper contains 36 sections, 5 figures, 5 tables.

Introduction
Schema-Guided Dialogue
Linguistic Criteria
Appropriateness.
Unambiguity.
World-Understanding.
The IndirectRequests Dataset
Generating the Seed Dataset
Crowdsourcing Human Labels
Unambiguity Annotation.
World-Understanding Annotation.
Dataset Splits
Proxy Evaluation of Linguistic Criteria
Unambiguity.
World-Understanding.
...and 21 more sections

Figures (5)

Figure 1: Two settings are illustrated for : restaurant-reservation and home-automation.
Figure 2: The five-stage generation pipeline.
Figure 3: We illustrate a dialogue schema in the music service domain, with an intent to play music and a slot for selecting a playback device (e.g., TV, kitchen speaker, bedroom speaker). Our approach generates an indirect utterance based on a specified slot value, such as 'TV.'
Figure 4: The M-Turk crowdsourcing interface for collecting human annotations over the seed dataset contains two form elements. The first assesses the Unambiguity in the generated utterance, ensuring that it entails only the target slot value. The second assesses the World-Understanding criterion, leveraging a slider to rate the likelihood that an average six-year-old could correctly infer the target slot value. The latter is an intuitive proxy to measure the complexity of world understanding required to interpret the utterance.
Figure 5: We report the qualities of the generated using smaller, open-source Llama 2 models of three different sizes (7B, 13B, 70B). All the evaluation results are obtained using the best-performing GPT-4 proxy evaluation model (as described in Section \ref{['sec:proxy-evaluators']}).

Making Task-Oriented Dialogue Datasets More Natural by Synthetically Generating Indirect User Requests

TL;DR

Abstract

Making Task-Oriented Dialogue Datasets More Natural by Synthetically Generating Indirect User Requests

Authors

TL;DR

Abstract

Table of Contents

Figures (5)