Table of Contents
Fetching ...

Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition

David M. Chan, Shalini Ghosh, Hitesh Tulsiani, Ariya Rastrow, Björn Hoffmeister

TL;DR

The paper tackles the challenge of low ASR quality in task-oriented dialogue (TOD) systems by leveraging contextual cues from conversations, including unsuccessful turns. It introduces Contrastive Learning for Conversations (CLC), a family of self-supervised fine-tuning losses consisting of Past-Future and N-best components, designed to extract implicit contextual signals from TODs. The authors validate CLC on a large real-world internal dataset and a new semi-synthetic TOD benchmark, OD3, achieving up to 6.77% relative WER improvement in real-world data and up to 19.22% on OD3, with preserved semantic content as indicated by BERT-S metrics. The work also provides OD3 as a public resource to spur further research in context-aware ASR for complex dialogue scenarios, highlighting practical gains in user experience for assistants and similar systems.

Abstract

While word error rates of automatic speech recognition (ASR) systems have consistently fallen, natural language understanding (NLU) applications built on top of ASR systems still attribute significant numbers of failures to low-quality speech recognition results. Existing assistant systems collect large numbers of these unsuccessful interactions, but these systems usually fail to learn from these interactions, even in an offline fashion. In this work, we introduce CLC: Contrastive Learning for Conversations, a family of methods for contrastive fine-tuning of models in a self-supervised fashion, making use of easily detectable artifacts in unsuccessful conversations with assistants. We demonstrate that our CLC family of approaches can improve the performance of ASR models on OD3, a new public large-scale semi-synthetic meta-dataset of audio task-oriented dialogues, by up to 19.2%. These gains transfer to real-world systems as well, where we show that CLC can help to improve performance by up to 6.7% over baselines. We make OD3 publicly available at https://github.com/amazon-science/amazon-od3 .

Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition

TL;DR

The paper tackles the challenge of low ASR quality in task-oriented dialogue (TOD) systems by leveraging contextual cues from conversations, including unsuccessful turns. It introduces Contrastive Learning for Conversations (CLC), a family of self-supervised fine-tuning losses consisting of Past-Future and N-best components, designed to extract implicit contextual signals from TODs. The authors validate CLC on a large real-world internal dataset and a new semi-synthetic TOD benchmark, OD3, achieving up to 6.77% relative WER improvement in real-world data and up to 19.22% on OD3, with preserved semantic content as indicated by BERT-S metrics. The work also provides OD3 as a public resource to spur further research in context-aware ASR for complex dialogue scenarios, highlighting practical gains in user experience for assistants and similar systems.

Abstract

While word error rates of automatic speech recognition (ASR) systems have consistently fallen, natural language understanding (NLU) applications built on top of ASR systems still attribute significant numbers of failures to low-quality speech recognition results. Existing assistant systems collect large numbers of these unsuccessful interactions, but these systems usually fail to learn from these interactions, even in an offline fashion. In this work, we introduce CLC: Contrastive Learning for Conversations, a family of methods for contrastive fine-tuning of models in a self-supervised fashion, making use of easily detectable artifacts in unsuccessful conversations with assistants. We demonstrate that our CLC family of approaches can improve the performance of ASR models on OD3, a new public large-scale semi-synthetic meta-dataset of audio task-oriented dialogues, by up to 19.2%. These gains transfer to real-world systems as well, where we show that CLC can help to improve performance by up to 6.7% over baselines. We make OD3 publicly available at https://github.com/amazon-science/amazon-od3 .
Paper Structure (12 sections, 6 equations, 2 figures, 5 tables)

This paper contains 12 sections, 6 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Task oriented dialogues can contain a multitude of relevant information for performing automated speech recognition. In this work, we explore how we can learn from both semantically linked keywords within dialogues, and failed dialogue turns.
  • Figure 2: Overview of CLC approaches. The Past-Future loss maximizes agreement between current, past, and future embeddings. The N-best loss minimizes agreement between current embeddings and top predictions of rephrases, while maximizing agreement otherwise.