Table of Contents
Fetching ...

Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer

Vishal Sunder, Eric Fosler-Lussier

TL;DR

This work tackles robust end-to-end SLU by jointly modeling ASR and SLU within an RNN-T framework using a self-conditioned CTC objective to softly condition SLU on ASR. It further leverages knowledge transfer from BERT to align acoustic embeddings and introduces a bag-of-entities auxiliary signal to guide decoding, yielding notable SLU gains that approach the performance of large models like Whisper with far fewer parameters. The proposed approach demonstrates significant improvements in slot filling and intent accuracy on SLURP, especially when combining ASR KT pretraining with SLU KT adaptation and BOE conditioning. The results suggest strong practical impact for compact, differentiable SLU systems that remain robust to ASR noise without relying on massive pretraining resources.

Abstract

In this paper, we propose to improve end-to-end (E2E) spoken language understand (SLU) in an RNN transducer model (RNN-T) by incorporating a joint self-conditioned CTC automatic speech recognition (ASR) objective. Our proposed model is akin to an E2E differentiable cascaded model which performs ASR and SLU sequentially and we ensure that the SLU task is conditioned on the ASR task by having CTC self conditioning. This novel joint modeling of ASR and SLU improves SLU performance significantly over just using SLU optimization. We further improve the performance by aligning the acoustic embeddings of this model with the semantically richer BERT model. Our proposed knowledge transfer strategy makes use of a bag-of-entity prediction layer on the aligned embeddings and the output of this is used to condition the RNN-T based SLU decoding. These techniques show significant improvement over several strong baselines and can perform at par with large models like Whisper with significantly fewer parameters.

Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer

TL;DR

This work tackles robust end-to-end SLU by jointly modeling ASR and SLU within an RNN-T framework using a self-conditioned CTC objective to softly condition SLU on ASR. It further leverages knowledge transfer from BERT to align acoustic embeddings and introduces a bag-of-entities auxiliary signal to guide decoding, yielding notable SLU gains that approach the performance of large models like Whisper with far fewer parameters. The proposed approach demonstrates significant improvements in slot filling and intent accuracy on SLURP, especially when combining ASR KT pretraining with SLU KT adaptation and BOE conditioning. The results suggest strong practical impact for compact, differentiable SLU systems that remain robust to ASR noise without relying on massive pretraining resources.

Abstract

In this paper, we propose to improve end-to-end (E2E) spoken language understand (SLU) in an RNN transducer model (RNN-T) by incorporating a joint self-conditioned CTC automatic speech recognition (ASR) objective. Our proposed model is akin to an E2E differentiable cascaded model which performs ASR and SLU sequentially and we ensure that the SLU task is conditioned on the ASR task by having CTC self conditioning. This novel joint modeling of ASR and SLU improves SLU performance significantly over just using SLU optimization. We further improve the performance by aligning the acoustic embeddings of this model with the semantically richer BERT model. Our proposed knowledge transfer strategy makes use of a bag-of-entity prediction layer on the aligned embeddings and the output of this is used to condition the RNN-T based SLU decoding. These techniques show significant improvement over several strong baselines and can perform at par with large models like Whisper with significantly fewer parameters.
Paper Structure (11 sections, 13 equations, 4 figures, 2 tables)

This paper contains 11 sections, 13 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A sequence of $L$ conformer layers serve as the transcription network. After every other conformer layer, we have an intermediate CTC loss with self-connections.
  • Figure 2: ASR pretraining with knowledge transfer. The transcription network is the same as figure \ref{['fig:model_sctc']}. Both RNN-T and SCTC losses are computed against the transcription.
  • Figure 3: SLU adaptation with knowledge transfer. After the ASR pretraining step in figure \ref{['fig:model_kt_asr']}, the model is adapted for SLU where the utterance level representation from the Attention block is utilized for predicting the bag-of-entities and is added to the joint network as in equation.
  • Figure 4: Alignment at different levels for a sample utterance "how cold is it outside today". Top: The RNN-T alignment between the SLU tags and the speech, Middle: character-level alignment at the last SCTC layer, Bottom: subword-level alignment sfrom the attention layer during KT pretraining.