Table of Contents
Fetching ...

A Non-autoregressive Model for Joint STT and TTS

Vishal Sunder, Brian Kingsbury, George Saon, Samuel Thomas, Slava Shechtman, Hagai Aronowitz, Eric Fosler-Lussier, Luis Lastras

TL;DR

This work introduces a fully non-autoregressive multimodal model that jointly performs STT and TTS while supporting training from paired and unpaired data. It combines a duration-based alignment module, masking-based self-supervision, and a multimodal encoder with task-specific heads, enabling input from text, speech, or both. An iterative refinement procedure feeds partial predictions back into the model to progressively improve outputs, with STT refinements guided by confidence and TTS refinements by progressive unmasking over $K$ steps. The approach demonstrates that joint training with unpaired data and iterative refinement can match or exceed task-specific baselines on STT and closely approach TTS performance, with notable gains in STT accuracy and speaker fidelity on standard benchmarks. This indicates a promising direction for efficient, scalable multimodal speech processing without autoregressive decoding.

Abstract

In this paper, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.

A Non-autoregressive Model for Joint STT and TTS

TL;DR

This work introduces a fully non-autoregressive multimodal model that jointly performs STT and TTS while supporting training from paired and unpaired data. It combines a duration-based alignment module, masking-based self-supervision, and a multimodal encoder with task-specific heads, enabling input from text, speech, or both. An iterative refinement procedure feeds partial predictions back into the model to progressively improve outputs, with STT refinements guided by confidence and TTS refinements by progressive unmasking over steps. The approach demonstrates that joint training with unpaired data and iterative refinement can match or exceed task-specific baselines on STT and closely approach TTS performance, with notable gains in STT accuracy and speaker fidelity on standard benchmarks. This indicates a promising direction for efficient, scalable multimodal speech processing without autoregressive decoding.

Abstract

In this paper, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.
Paper Structure (13 sections, 4 equations, 3 figures, 3 tables)

This paper contains 13 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Model overview. The dashed arrows from the output back to the input represent the behavior during iterative refinement as explained in Section \ref{['subsec:iterative refinement']}.
  • Figure 2: Iterative refinement illustration. For TTS (left), we show 4 iterations where the log-mel are gradually unmasked. For STT (right), the word "CAT" is to be predicted which is refined through 3 iterations. "[m]" refers to <mask>
  • Figure 3: Effect of number of iterations of iterative refinement on STT (left) and TTS (right). These results are on the development sets.