Table of Contents
Fetching ...

Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding

Suyoung Kim, Jiyeon Hwang, Ho-Young Jung

TL;DR

This work tackles the robustness of module-based Spoken Language Understanding to Automatic Speech Recognition (ASR) errors by introducing Contrastive and Consistency Learning (CCL). CCL combines token-based contrastive learning (selective-token and utterance-level) with a consistency objective that ties the noisy ASR input to a cleaner reference through a reference network, all within a neural noisy-channel framework. Across four SLU benchmarks (SLURP, Timers, FSC, SNIPS), CCL yields substantial gains in accuracy and macro-F1, particularly under high WER conditions, and remains competitive or superior to large-language-model baselines in noisy settings. The approach also provides visualizations and ablations that highlight interpretable token-level alignments and latent coherence, supporting practical improvements for robust SLU in real-world, noisy environments, with code released for reproducibility.

Abstract

Recently, deep end-to-end learning has been studied for intent classification in Spoken Language Understanding (SLU). However, end-to-end models require a large amount of speech data with intent labels, and highly optimized models are generally sensitive to the inconsistency between the training and evaluation conditions. Therefore, a natural language understanding approach based on Automatic Speech Recognition (ASR) remains attractive because it can utilize a pre-trained general language model and adapt to the mismatch of the speech input environment. Using this module-based approach, we improve a noisy-channel model to handle transcription inconsistencies caused by ASR errors. We propose a two-stage method, Contrastive and Consistency Learning (CCL), that correlates error patterns between clean and noisy ASR transcripts and emphasizes the consistency of the latent features of the two transcripts. Experiments on four benchmark datasets show that CCL outperforms existing methods and improves the ASR robustness in various noisy environments. Code is available at https://github.com/syoung7388/CCL.

Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding

TL;DR

This work tackles the robustness of module-based Spoken Language Understanding to Automatic Speech Recognition (ASR) errors by introducing Contrastive and Consistency Learning (CCL). CCL combines token-based contrastive learning (selective-token and utterance-level) with a consistency objective that ties the noisy ASR input to a cleaner reference through a reference network, all within a neural noisy-channel framework. Across four SLU benchmarks (SLURP, Timers, FSC, SNIPS), CCL yields substantial gains in accuracy and macro-F1, particularly under high WER conditions, and remains competitive or superior to large-language-model baselines in noisy settings. The approach also provides visualizations and ablations that highlight interpretable token-level alignments and latent coherence, supporting practical improvements for robust SLU in real-world, noisy environments, with code released for reproducibility.

Abstract

Recently, deep end-to-end learning has been studied for intent classification in Spoken Language Understanding (SLU). However, end-to-end models require a large amount of speech data with intent labels, and highly optimized models are generally sensitive to the inconsistency between the training and evaluation conditions. Therefore, a natural language understanding approach based on Automatic Speech Recognition (ASR) remains attractive because it can utilize a pre-trained general language model and adapt to the mismatch of the speech input environment. Using this module-based approach, we improve a noisy-channel model to handle transcription inconsistencies caused by ASR errors. We propose a two-stage method, Contrastive and Consistency Learning (CCL), that correlates error patterns between clean and noisy ASR transcripts and emphasizes the consistency of the latent features of the two transcripts. Experiments on four benchmark datasets show that CCL outperforms existing methods and improves the ASR robustness in various noisy environments. Code is available at https://github.com/syoung7388/CCL.
Paper Structure (30 sections, 7 equations, 9 figures, 12 tables)

This paper contains 30 sections, 7 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: An overview of CCL. (a) Given a noisy ASR transcript, the inference network correlates error portions through token-based contrastive learning and then maintains coherence with the latent feature of the clean transcript through consistency learning. (b) In evaluation, inference network takes only noisy ASR transcript as input.
  • Figure 2: Selective-token contrastive learning finds a positive pair based on the edit distance for the given clean transcript “Set volume to zero” and noisy ASR transcript “Sat volume to the cero”, and conducts contrastive learning as shown on the right side.
  • Figure 3: t-SNE visualization of clean and noisy ASR transcripts in word and utterance token-levels for SLURP test set. (a-b) each color represents a sample of clean transcript and associated noisy ASR transcripts. (c-d) each color indicates the tokens of sample, such as clean token and noisy ASR tokens.
  • Figure 4: Token-based similarity map visualization results for FSC validation and test sets. X-axis and Y-axis represent clean and noisy ASR transcripts, respectively.
  • Figure 5: Top-5 intent prediction distributions on SLURP dataset from reference network, inference network (trained with CCL), and baseline model (trained Noisy-CE).
  • ...and 4 more figures