Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis
Akshita Gupta, Tatiana Likhomanenko, Karren Dai Yang, Richard He Bai, Zakaria Aldeneh, Navdeep Jaitly
TL;DR
Visatronic presents a unified decoder-only transformer that ingests video, text, and speech as temporally aligned tokens to perform Video-Text to Speech (VTTS) generation. It discretizes each modality (VQ-VAE video tokens, character text tokens, and dMel speech tokens) and trains a single autoregressive model to predict speech tokens from multimodal inputs, using RoPE-based positional encodings and robust input mixing. The work introduces TimeSync, a phoneme-level synchronization metric, and demonstrates strong zero-shot generalization from VoxCeleb2 to LRS3, achieving 12.2% WER on VoxCeleb2 and 4.5% WER on LRS3, surpassing prior baselines. Ablations show the importance of both video and text conditioning and reveal that simple video aggregation suffices, underscoring the viability of end-to-end multimodal decoding for temporally coherent speech synthesis with potential applications in dubbing and expressive speech generation.
Abstract
The rapid progress of foundation models and large language models (LLMs) has fueled significantly improvement in the capabilities of machine learning systems that benefit from mutlimodal input data. However, existing multimodal models are predominantly built on top of pre-trained LLMs, which can limit accurate modeling of temporal dependencies across other modalities and thus limit the model's ability to jointly process and leverage multimodal inputs. To specifically investigate the alignment of text, video, and speech modalities in LLM-style (decoder-only) models, we consider a simplified multimodal generation task, Video-Text to Speech (VTTS): speech generation conditioned on both its corresponding text and video of talking people. The ultimate goal is to generate speech that not only follows the text but also aligns temporally with the video and is consistent with the facial expressions. In this paper, we first introduce Visatronic, a unified multimodal decoder-only transformer model that adopts an LLM-style architecture to embed visual, textual, and speech inputs into a shared subspace, treating all modalities as temporally aligned token streams. Next, we carefully explore different token mixing strategies to understand the best way to propagate information from the steps where video and text conditioning is input to the steps where the audio is generated. We extensively evaluate Visatronic on the challenging VoxCeleb2 dataset and demonstrate zero-shot generalization to LRS3, where Visatronic, trained on VoxCeleb2, achieves a 4.5% WER, outperforming prior SOTA methods trained only on LRS3, which report a 21.4% WER. Additionally, we propose a new objective metric, TimeSync, specifically designed to measure phoneme-level temporal alignment between generated and reference speech, further ensuring synchronization quality. Demo: https://apple.github.io/visatronic-demo/
