Table of Contents
Fetching ...

An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction

Stefan Larson, Anish Mahendran, Joseph J. Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K. Kummerfeld, Kevin Leach, Michael A. Laurenzano, Lingjia Tang, Jason Mars

TL;DR

This paper presents a large crowdsourced dataset for intent classification that explicitly includes out-of-scope queries, capturing 150 in-scope intents across 10 domains and 1,200 out-of-scope samples across 23,700 total queries. It evaluates a range of classifiers, including BERT, across several data variants and three out-of-scope prediction schemes (oos-train, oos-threshold, oos-binary), reporting strong in-scope performance but substantially weaker out-of-scope recall. Increasing the amount of out-of-scope training data improves recall but does not close the gap with in-scope accuracy, underscoring the difficulty of OOS detection in realistic, short, user-generated queries. The dataset and evaluations provide a critical benchmark to drive development of more robust task-oriented dialog systems capable of safely handling queries outside their supported scope.

Abstract

Task-oriented dialog systems need to know when a query falls outside their range of supported intents, but current text classification corpora only define label sets that cover every example. We introduce a new dataset that includes queries that are out-of-scope---i.e., queries that do not fall into any of the system's supported intents. This poses a new challenge because models cannot assume that every query at inference time belongs to a system-supported intent class. Our dataset also covers 150 intent classes over 10 domains, capturing the breadth that a production task-oriented agent must handle. We evaluate a range of benchmark classifiers on our dataset along with several different out-of-scope identification schemes. We find that while the classifiers perform well on in-scope intent classification, they struggle to identify out-of-scope queries. Our dataset and evaluation fill an important gap in the field, offering a way of more rigorously and realistically benchmarking text classification in task-driven dialog systems.

An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction

TL;DR

This paper presents a large crowdsourced dataset for intent classification that explicitly includes out-of-scope queries, capturing 150 in-scope intents across 10 domains and 1,200 out-of-scope samples across 23,700 total queries. It evaluates a range of classifiers, including BERT, across several data variants and three out-of-scope prediction schemes (oos-train, oos-threshold, oos-binary), reporting strong in-scope performance but substantially weaker out-of-scope recall. Increasing the amount of out-of-scope training data improves recall but does not close the gap with in-scope accuracy, underscoring the difficulty of OOS detection in realistic, short, user-generated queries. The dataset and evaluations provide a critical benchmark to drive development of more robust task-oriented dialog systems capable of safely handling queries outside their supported scope.

Abstract

Task-oriented dialog systems need to know when a query falls outside their range of supported intents, but current text classification corpora only define label sets that cover every example. We introduce a new dataset that includes queries that are out-of-scope---i.e., queries that do not fall into any of the system's supported intents. This poses a new challenge because models cannot assume that every query at inference time belongs to a system-supported intent class. Our dataset also covers 150 intent classes over 10 domains, capturing the breadth that a production task-oriented agent must handle. We evaluate a range of benchmark classifiers on our dataset along with several different out-of-scope identification schemes. We find that while the classifiers perform well on in-scope intent classification, they struggle to identify out-of-scope queries. Our dataset and evaluation fill an important gap in the field, offering a way of more rigorously and realistically benchmarking text classification in task-driven dialog systems.

Paper Structure

This paper contains 16 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Example exchanges between a user (blue, right side) and a task-driven dialog system for personal finance (grey, left side). The system correctly identifies the user's query in , but in the user's query is mis-identified as in-scope, and the system gives an unrelated response. In the user's query is correctly identified as out-of-scope and the system gives a fallback response.