Table of Contents
Fetching ...

Unsupervised End-to-End Task-Oriented Dialogue with LLMs: The Power of the Noisy Channel

Brendan King, Jeffrey Flanigan

TL;DR

An innovative approach using expectation-maximization (EM) that infers turn-level annotations as latent variables using a noisy channel model to build an end-to-end dialogue agent that more than doubles the dialogue success rate of a strong GPT-3.5 baseline.

Abstract

Training task-oriented dialogue systems typically requires turn-level annotations for interacting with their APIs: e.g. a dialogue state and the system actions taken at each step. These annotations can be costly to produce, error-prone, and require both domain and annotation expertise. With advances in LLMs, we hypothesize that unlabeled data and a schema definition are sufficient for building a working task-oriented dialogue system, completely unsupervised. We consider a novel unsupervised setting of only (1) a well-defined API schema (2) a set of unlabeled dialogues between a user and agent. We propose an innovative approach using expectation-maximization (EM) that infers turn-level annotations as latent variables using a noisy channel model to build an end-to-end dialogue agent. Evaluating our approach on the MultiWOZ benchmark, our method more than doubles the dialogue success rate of a strong GPT-3.5 baseline.

Unsupervised End-to-End Task-Oriented Dialogue with LLMs: The Power of the Noisy Channel

TL;DR

An innovative approach using expectation-maximization (EM) that infers turn-level annotations as latent variables using a noisy channel model to build an end-to-end dialogue agent that more than doubles the dialogue success rate of a strong GPT-3.5 baseline.

Abstract

Training task-oriented dialogue systems typically requires turn-level annotations for interacting with their APIs: e.g. a dialogue state and the system actions taken at each step. These annotations can be costly to produce, error-prone, and require both domain and annotation expertise. With advances in LLMs, we hypothesize that unlabeled data and a schema definition are sufficient for building a working task-oriented dialogue system, completely unsupervised. We consider a novel unsupervised setting of only (1) a well-defined API schema (2) a set of unlabeled dialogues between a user and agent. We propose an innovative approach using expectation-maximization (EM) that infers turn-level annotations as latent variables using a noisy channel model to build an end-to-end dialogue agent. Evaluating our approach on the MultiWOZ benchmark, our method more than doubles the dialogue success rate of a strong GPT-3.5 baseline.
Paper Structure (40 sections, 3 equations, 8 figures, 8 tables)

This paper contains 40 sections, 3 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: An overview of our unsupervised dialogue problem. We assume 1) unlabeled goal-oriented dialogues between a user and agent and 2) a well-defined schema $\mathcal{S}$ with APIs suitable for fulfilling goals. We infer the unseen interactions between the agent and API, and use this to produce an end-to-end dialogue agent.
  • Figure 2: An overview of the latent variables annotated in our unsupervised labeling process which are used to train the dialogue model. Our DST Module (\ref{['sec:methods-dst']}) infers the API call(s) with arguments at each turn, from which we can derive the dialogue state change. Our DAT or Act Tagging module (\ref{['sec:methods-tagging']}) predicts the dialogue acts communicated in the observed system response, which can be used to infer de-lexicalized responses for training a response generator.
  • Figure 3: Instances from our 'direct' and 'noisy channel' prompts for DST. Best viewed in color. After sampling a DST completion from the 'direct' prompt, we score it by the likelihood of the input user utterance conditioned on it in the 'noisy channel' prompt.
  • Figure 4: Combined score ($0.5(\text{Inform} + \text{Success}) + BLEU$) vs. the number of steps of expectation-maximization in our Noisy Channel method vs. a Greedy Ablation. '0' is zero-shot inference
  • Figure 5: log(Frequency) vs. Rank of dialogue acts used by each model over a 200 dialogue sample of the validation set. 'Natural' refers to human annotations. We find our Noisy Channel approach uses a higher number of unique dialogue acts than the Greedy approach and better matches the characteristics of the distribution used by human annotators
  • ...and 3 more figures