Table of Contents
Fetching ...

Generating Hard-Negative Out-of-Scope Data with ChatGPT for Intent Classification

Zhijian Li, Stefan Larson, Kevin Leach

TL;DR

It is shown that classifiers struggle to correctly identify hard-negative OOS utterances more than general OOS utterances, and incorporating hard-negative OOS data for training improves model robustness when detecting hard-negative OOS data and general OOS data.

Abstract

Intent classifiers must be able to distinguish when a user's utterance does not belong to any supported intent to avoid producing incorrect and unrelated system responses. Although out-of-scope (OOS) detection for intent classifiers has been studied, previous work has not yet studied changes in classifier performance against hard-negative out-of-scope utterances (i.e., inputs that share common features with in-scope data, but are actually out-of-scope). We present an automated technique to generate hard-negative OOS data using ChatGPT. We use our technique to build five new hard-negative OOS datasets, and evaluate each against three benchmark intent classifiers. We show that classifiers struggle to correctly identify hard-negative OOS utterances more than general OOS utterances. Finally, we show that incorporating hard-negative OOS data for training improves model robustness when detecting hard-negative OOS data and general OOS data. Our technique, datasets, and evaluation address an important void in the field, offering a straightforward and inexpensive way to collect hard-negative OOS data and improve intent classifiers' robustness.

Generating Hard-Negative Out-of-Scope Data with ChatGPT for Intent Classification

TL;DR

It is shown that classifiers struggle to correctly identify hard-negative OOS utterances more than general OOS utterances, and incorporating hard-negative OOS data for training improves model robustness when detecting hard-negative OOS data and general OOS data.

Abstract

Intent classifiers must be able to distinguish when a user's utterance does not belong to any supported intent to avoid producing incorrect and unrelated system responses. Although out-of-scope (OOS) detection for intent classifiers has been studied, previous work has not yet studied changes in classifier performance against hard-negative out-of-scope utterances (i.e., inputs that share common features with in-scope data, but are actually out-of-scope). We present an automated technique to generate hard-negative OOS data using ChatGPT. We use our technique to build five new hard-negative OOS datasets, and evaluate each against three benchmark intent classifiers. We show that classifiers struggle to correctly identify hard-negative OOS utterances more than general OOS utterances. Finally, we show that incorporating hard-negative OOS data for training improves model robustness when detecting hard-negative OOS data and general OOS data. Our technique, datasets, and evaluation address an important void in the field, offering a straightforward and inexpensive way to collect hard-negative OOS data and improve intent classifiers' robustness.
Paper Structure (23 sections, 4 figures, 4 tables)

This paper contains 23 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Example exchanges between a user (blue, right side) and a task-driven dialog system for personal finance (grey, left side). The system correctly identifies the user’s utterance as in-scope in ①, and correctly identifies the user's utterance as out-of-scope and gives a valid response in ②. In ③, the system incorrectly identifies the hard-negative OOS user utterance as in-scope and provides an incorrect response.
  • Figure 2: An overview of the hard-negative OOS generation process, including examples. The third generated utterance is filtered out during the two-step OOS verification.
  • Figure 3: Results for Banking77 evaluated with BERT. (a) shows the distribution of softmax confidence scores. (b) shows the distribution of energy confidence scores. (c) shows the F1 score of softmax confidence score for hard-negative OOS and general OOS with in-scope at different confidence thresholds.
  • Figure 4: Distribution of softmax confidence scores for Clinc-150 evaluated with RoBERTa.