Table of Contents
Fetching ...

Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

Dennis Ulmer, Elman Mansimov, Kaixiang Lin, Justin Sun, Xibin Gao, Yi Zhang

TL;DR

This work tackles the data bottleneck in adapting LLMs to task-oriented dialogues by enabling self-generated training data through a self-talk loop between a client and an agent LLM. It introduces structured prompting and workflow-graph prompting to steer conversations, along with automated metrics for subgoal completion, ending detection, and character consistency, which are validated against human judgments. Finetuning experiments show that carefully filtered self-generated data can improve agent performance, though data quality and diversity trade-offs matter and multi-loop self-talk can be unstable. The study demonstrates the feasibility of bootstrapping task-oriented dialogue agents from their own outputs and provides automated evaluation tools and insights for future self-improvement research in LLMs.

Abstract

Large language models (LLMs) are powerful dialogue agents, but specializing them towards fulfilling a specific function can be challenging. Instructing tuning, i.e. tuning models on instruction and sample responses generated by humans (Ouyang et al., 2022), has proven as an effective method to do so, yet requires a number of data samples that a) might not be available or b) costly to generate. Furthermore, this cost increases when the goal is to make the LLM follow a specific workflow within a dialogue instead of single instructions. Inspired by the self-play technique in reinforcement learning and the use of LLMs to simulate human agents, we propose a more effective method for data collection through LLMs engaging in a conversation in various roles. This approach generates a training data via "self-talk" of LLMs that can be refined and utilized for supervised fine-tuning. We introduce an automated way to measure the (partial) success of a dialogue. This metric is used to filter the generated conversational data that is fed back in LLM for training. Based on our automated and human evaluations of conversation quality, we demonstrate that such self-talk data improves results. In addition, we examine the various characteristics that showcase the quality of generated dialogues and how they can be connected to their potential utility as training data.

Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

TL;DR

This work tackles the data bottleneck in adapting LLMs to task-oriented dialogues by enabling self-generated training data through a self-talk loop between a client and an agent LLM. It introduces structured prompting and workflow-graph prompting to steer conversations, along with automated metrics for subgoal completion, ending detection, and character consistency, which are validated against human judgments. Finetuning experiments show that carefully filtered self-generated data can improve agent performance, though data quality and diversity trade-offs matter and multi-loop self-talk can be unstable. The study demonstrates the feasibility of bootstrapping task-oriented dialogue agents from their own outputs and provides automated evaluation tools and insights for future self-improvement research in LLMs.

Abstract

Large language models (LLMs) are powerful dialogue agents, but specializing them towards fulfilling a specific function can be challenging. Instructing tuning, i.e. tuning models on instruction and sample responses generated by humans (Ouyang et al., 2022), has proven as an effective method to do so, yet requires a number of data samples that a) might not be available or b) costly to generate. Furthermore, this cost increases when the goal is to make the LLM follow a specific workflow within a dialogue instead of single instructions. Inspired by the self-play technique in reinforcement learning and the use of LLMs to simulate human agents, we propose a more effective method for data collection through LLMs engaging in a conversation in various roles. This approach generates a training data via "self-talk" of LLMs that can be refined and utilized for supervised fine-tuning. We introduce an automated way to measure the (partial) success of a dialogue. This metric is used to filter the generated conversational data that is fed back in LLM for training. Based on our automated and human evaluations of conversation quality, we demonstrate that such self-talk data improves results. In addition, we examine the various characteristics that showcase the quality of generated dialogues and how they can be connected to their potential utility as training data.
Paper Structure (54 sections, 1 equation, 19 figures, 2 tables)

This paper contains 54 sections, 1 equation, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Schematic representation of our approach. Two LLMs, called a client and an agent, are prompted to converse with each other in different roles, with the agent asked to follow a specific narrative structure. Generated conversations will then be filtered by quality and used for supervised finetuning on the agent model until it adapts to the intended dialogue structure.
  • Figure 2: Illustration of the structured prompting: Workflows are parsed into a directed graph (left). At every turn of the conversation, we ask a LLM to compare the client's last utterance with the reference responses corresponding the outgoing edges of the current node. If one of them is chosen, we continue with the next node in the graph and prompt the agent with the corresponding question next turn, otherwise we stay in the same place and in graph and let the model generate freely.
  • Figure 3: Analysis of the relationship between properties of the finetuning dataset and their impact on the absolute completion of the dialogue, given (a) Spearman's $\rho$ correlation values and (b) the coefficients of the linear regression model without a bias and with lasso regularization. Error bars and the regularization weight were determined via cross-validation.
  • Figure 4: Results of the human evaluation study for three baselines and the two best filters from \ref{['sec:bootstrapping-experiment']} along six different questions. Shown are the percentage of ratings per filter, either on a five point scale or using positive (), negative () and unsure () options. Dashed lines indicate the numerical average and $\bigstar$ signifies statistical significance compared to all other options assessed via the ASO test del2018optimaldror2019deepulmer2022deep with $\tau = 0.5$ and a confidence level of $\alpha = 0.9$.
  • Figure 5: Conversation generated after finetuning with the %-Subgoals (0.05) filter, with the agent ignoring the given workflow.
  • ...and 14 more figures