Table of Contents
Fetching ...

Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems

Andrea Madotto, Zihan Liu, Zhaojiang Lin, Pascale Fung

TL;DR

The paper tackles data efficiency in task-oriented dialogue by assessing few-shot learning via prompting large language models for four core tasks: NLU, DST, ACT, and NLG. It introduces LM priming with fixed prefixes (binary, value-based, generative) that avoids parameter updates, enabling a single LM to handle multiple tasks. Experimental results show that larger GPT-2 variants improve NLU and NLG performance and can reach or exceed weak finetuning baselines in some settings, while DST and ACT show more variable gains and remain challenging. The authors identify major limitations—context window size and computational cost due to multiple forwards—and propose future work on longer-context models and end-to-end dialogue benchmarks to extend this approach.

Abstract

Task-oriented dialogue systems use four connected modules, namely, Natural Language Understanding (NLU), a Dialogue State Tracking (DST), Dialogue Policy (DP) and Natural Language Generation (NLG). A research challenge is to learn each module with the least amount of samples (i.e., few-shots) given the high cost related to the data collection. The most common and effective technique to solve this problem is transfer learning, where large language models, either pre-trained on text or task-specific data, are fine-tuned on the few samples. These methods require fine-tuning steps and a set of parameters for each task. Differently, language models, such as GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020), allow few-shot learning by priming the model with few examples. In this paper, we evaluate the priming few-shot ability of language models in the NLU, DST, DP and NLG tasks. Importantly, we highlight the current limitations of this approach, and we discuss the possible implication for future work.

Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems

TL;DR

The paper tackles data efficiency in task-oriented dialogue by assessing few-shot learning via prompting large language models for four core tasks: NLU, DST, ACT, and NLG. It introduces LM priming with fixed prefixes (binary, value-based, generative) that avoids parameter updates, enabling a single LM to handle multiple tasks. Experimental results show that larger GPT-2 variants improve NLU and NLG performance and can reach or exceed weak finetuning baselines in some settings, while DST and ACT show more variable gains and remain challenging. The authors identify major limitations—context window size and computational cost due to multiple forwards—and propose future work on longer-context models and end-to-end dialogue benchmarks to extend this approach.

Abstract

Task-oriented dialogue systems use four connected modules, namely, Natural Language Understanding (NLU), a Dialogue State Tracking (DST), Dialogue Policy (DP) and Natural Language Generation (NLG). A research challenge is to learn each module with the least amount of samples (i.e., few-shots) given the high cost related to the data collection. The most common and effective technique to solve this problem is transfer learning, where large language models, either pre-trained on text or task-specific data, are fine-tuned on the few samples. These methods require fine-tuning steps and a set of parameters for each task. Differently, language models, such as GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020), allow few-shot learning by priming the model with few examples. In this paper, we evaluate the priming few-shot ability of language models in the NLU, DST, DP and NLG tasks. Importantly, we highlight the current limitations of this approach, and we discuss the possible implication for future work.

Paper Structure

This paper contains 18 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Language model priming for few-shot intent recognition. Image inspired by OpenAI GPT-3 brown2020language. Few examples are provided along with the sample to be predicted as the prefix to the language model.
  • Figure 2:
  • Figure 3: Example of 1-shot LM priming for the ACT task and results in the task. BERT and ToD-BERT are from wu2020tod and they use 500-shots.
  • Figure 4: Example of 1-shot LM priming for the NLG task and results in the task. SC-LSTM, GPT-2, and SC-GPT-2 are from peng2020few.
  • Figure 5: Example of 1-shot LM priming for the SLOT-FILLING and INTENT task and results in the task. CT, RZT, and Coach are from liu2020coach and they use 20-shots.
  • ...and 3 more figures