Table of Contents
Fetching ...

Streaming Bilingual End-to-End ASR model using Attention over Multiple Softmax

Aditya Patil, Vikas Joshi, Purvi Agrawal, Rupesh Mehta

TL;DR

This work addresses multilingual and code-mixed ASR without explicit language input by proposing a streaming, on-device bilingual E2E model. It extends the MultiSoftmax framework with a self-attention mechanism that weights language-specific posteriors to produce a single posterior over a combined symbol set, enabling a single beam search and dynamic language switching. The approach yields substantial WER reductions across Hindi, English, and code-mixed sets, with attention particularly improving code-mixed performance, and demonstrates the model’s implicit language identification capability through attention analysis. The proposed method offers a practical pathway for truly multilingual on-device ASR with low latency and reduced memory footprint.

Abstract

Even with several advancements in multilingual modeling, it is challenging to recognize multiple languages using a single neural model, without knowing the input language and most multilingual models assume the availability of the input language. In this work, we propose a novel bilingual end-to-end (E2E) modeling approach, where a single neural model can recognize both languages and also support switching between the languages, without any language input from the user. The proposed model has shared encoder and prediction networks, with language-specific joint networks that are combined via a self-attention mechanism. As the language-specific posteriors are combined, it produces a single posterior probability over all the output symbols, enabling a single beam search decoding and also allowing dynamic switching between the languages. The proposed approach outperforms the conventional bilingual baseline with 13.3%, 8.23% and 1.3% word error rate relative reduction on Hindi, English and code-mixed test sets, respectively.

Streaming Bilingual End-to-End ASR model using Attention over Multiple Softmax

TL;DR

This work addresses multilingual and code-mixed ASR without explicit language input by proposing a streaming, on-device bilingual E2E model. It extends the MultiSoftmax framework with a self-attention mechanism that weights language-specific posteriors to produce a single posterior over a combined symbol set, enabling a single beam search and dynamic language switching. The approach yields substantial WER reductions across Hindi, English, and code-mixed sets, with attention particularly improving code-mixed performance, and demonstrates the model’s implicit language identification capability through attention analysis. The proposed method offers a practical pathway for truly multilingual on-device ASR with low latency and reduced memory footprint.

Abstract

Even with several advancements in multilingual modeling, it is challenging to recognize multiple languages using a single neural model, without knowing the input language and most multilingual models assume the availability of the input language. In this work, we propose a novel bilingual end-to-end (E2E) modeling approach, where a single neural model can recognize both languages and also support switching between the languages, without any language input from the user. The proposed model has shared encoder and prediction networks, with language-specific joint networks that are combined via a self-attention mechanism. As the language-specific posteriors are combined, it produces a single posterior probability over all the output symbols, enabling a single beam search decoding and also allowing dynamic switching between the languages. The proposed approach outperforms the conventional bilingual baseline with 13.3%, 8.23% and 1.3% word error rate relative reduction on Hindi, English and code-mixed test sets, respectively.
Paper Structure (12 sections, 4 figures, 2 tables)

This paper contains 12 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Block schematic of (1) Vanilla bilingual model, (2) Bilingual MultiSoftmax model, and (3) Bilingual MultiSoftmax with Attention model.
  • Figure 2: Attention weights over time frames for (a) English, and (b) Hindi utterance, respectively.
  • Figure 3: Attention weights over time frames for code-mixed utterances.
  • Figure 4: Plot of probability density function (pdf) for attention weights for English, Hindi and code-mixed utterances, respectively. The weight values on x-axis extend beyond and $0$ and $1.0$ as we plot the entire multi-modal Gaussian distribution estimated over the weight values. The y-axis represents the likelihood of the pdf.