A Non-autoregressive Model for Joint STT and TTS
Vishal Sunder, Brian Kingsbury, George Saon, Samuel Thomas, Slava Shechtman, Hagai Aronowitz, Eric Fosler-Lussier, Luis Lastras
TL;DR
This work introduces a fully non-autoregressive multimodal model that jointly performs STT and TTS while supporting training from paired and unpaired data. It combines a duration-based alignment module, masking-based self-supervision, and a multimodal encoder with task-specific heads, enabling input from text, speech, or both. An iterative refinement procedure feeds partial predictions back into the model to progressively improve outputs, with STT refinements guided by confidence and TTS refinements by progressive unmasking over $K$ steps. The approach demonstrates that joint training with unpaired data and iterative refinement can match or exceed task-specific baselines on STT and closely approach TTS performance, with notable gains in STT accuracy and speaker fidelity on standard benchmarks. This indicates a promising direction for efficient, scalable multimodal speech processing without autoregressive decoding.
Abstract
In this paper, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.
