Table of Contents
Fetching ...

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

Akshita Gupta, Tatiana Likhomanenko, Karren Dai Yang, Richard He Bai, Zakaria Aldeneh, Navdeep Jaitly

TL;DR

Visatronic presents a unified decoder-only transformer that ingests video, text, and speech as temporally aligned tokens to perform Video-Text to Speech (VTTS) generation. It discretizes each modality (VQ-VAE video tokens, character text tokens, and dMel speech tokens) and trains a single autoregressive model to predict speech tokens from multimodal inputs, using RoPE-based positional encodings and robust input mixing. The work introduces TimeSync, a phoneme-level synchronization metric, and demonstrates strong zero-shot generalization from VoxCeleb2 to LRS3, achieving 12.2% WER on VoxCeleb2 and 4.5% WER on LRS3, surpassing prior baselines. Ablations show the importance of both video and text conditioning and reveal that simple video aggregation suffices, underscoring the viability of end-to-end multimodal decoding for temporally coherent speech synthesis with potential applications in dubbing and expressive speech generation.

Abstract

The rapid progress of foundation models and large language models (LLMs) has fueled significantly improvement in the capabilities of machine learning systems that benefit from mutlimodal input data. However, existing multimodal models are predominantly built on top of pre-trained LLMs, which can limit accurate modeling of temporal dependencies across other modalities and thus limit the model's ability to jointly process and leverage multimodal inputs. To specifically investigate the alignment of text, video, and speech modalities in LLM-style (decoder-only) models, we consider a simplified multimodal generation task, Video-Text to Speech (VTTS): speech generation conditioned on both its corresponding text and video of talking people. The ultimate goal is to generate speech that not only follows the text but also aligns temporally with the video and is consistent with the facial expressions. In this paper, we first introduce Visatronic, a unified multimodal decoder-only transformer model that adopts an LLM-style architecture to embed visual, textual, and speech inputs into a shared subspace, treating all modalities as temporally aligned token streams. Next, we carefully explore different token mixing strategies to understand the best way to propagate information from the steps where video and text conditioning is input to the steps where the audio is generated. We extensively evaluate Visatronic on the challenging VoxCeleb2 dataset and demonstrate zero-shot generalization to LRS3, where Visatronic, trained on VoxCeleb2, achieves a 4.5% WER, outperforming prior SOTA methods trained only on LRS3, which report a 21.4% WER. Additionally, we propose a new objective metric, TimeSync, specifically designed to measure phoneme-level temporal alignment between generated and reference speech, further ensuring synchronization quality. Demo: https://apple.github.io/visatronic-demo/

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

TL;DR

Visatronic presents a unified decoder-only transformer that ingests video, text, and speech as temporally aligned tokens to perform Video-Text to Speech (VTTS) generation. It discretizes each modality (VQ-VAE video tokens, character text tokens, and dMel speech tokens) and trains a single autoregressive model to predict speech tokens from multimodal inputs, using RoPE-based positional encodings and robust input mixing. The work introduces TimeSync, a phoneme-level synchronization metric, and demonstrates strong zero-shot generalization from VoxCeleb2 to LRS3, achieving 12.2% WER on VoxCeleb2 and 4.5% WER on LRS3, surpassing prior baselines. Ablations show the importance of both video and text conditioning and reveal that simple video aggregation suffices, underscoring the viability of end-to-end multimodal decoding for temporally coherent speech synthesis with potential applications in dubbing and expressive speech generation.

Abstract

The rapid progress of foundation models and large language models (LLMs) has fueled significantly improvement in the capabilities of machine learning systems that benefit from mutlimodal input data. However, existing multimodal models are predominantly built on top of pre-trained LLMs, which can limit accurate modeling of temporal dependencies across other modalities and thus limit the model's ability to jointly process and leverage multimodal inputs. To specifically investigate the alignment of text, video, and speech modalities in LLM-style (decoder-only) models, we consider a simplified multimodal generation task, Video-Text to Speech (VTTS): speech generation conditioned on both its corresponding text and video of talking people. The ultimate goal is to generate speech that not only follows the text but also aligns temporally with the video and is consistent with the facial expressions. In this paper, we first introduce Visatronic, a unified multimodal decoder-only transformer model that adopts an LLM-style architecture to embed visual, textual, and speech inputs into a shared subspace, treating all modalities as temporally aligned token streams. Next, we carefully explore different token mixing strategies to understand the best way to propagate information from the steps where video and text conditioning is input to the steps where the audio is generated. We extensively evaluate Visatronic on the challenging VoxCeleb2 dataset and demonstrate zero-shot generalization to LRS3, where Visatronic, trained on VoxCeleb2, achieves a 4.5% WER, outperforming prior SOTA methods trained only on LRS3, which report a 21.4% WER. Additionally, we propose a new objective metric, TimeSync, specifically designed to measure phoneme-level temporal alignment between generated and reference speech, further ensuring synchronization quality. Demo: https://apple.github.io/visatronic-demo/

Paper Structure

This paper contains 38 sections, 1 equation, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Visatronic overview. In addition to existing text to speech (leftmost) and lips to speech tasks (middle), we address multimodal generative task (rightmost), video-text to speech (VTTS), where the model is conditioned on the video of talking people and corresponding text transcriptions in order to generate speech. Visatronic is a unified decoder-only transformer that processes video $\mathbf{v}$ (grey), text $\mathbf{t}$ (grey), and speech $\mathbf{s}$ (blue) as discrete tokens in a shared sequence. The model is trained with cross-entropy loss $\mathcal{L}_{CE}$ to predict speech tokens, learning cross-modal interactions and temporal alignment.
  • Figure 2: Video representation. Each video frame at time $t$ is encoded via a VQ-VAE yan2021videogpt into a downsampled spatial grid in $\mathbb{R}^{H'\times W'\times D}$. Each vector at location $(h, w)$ is quantized to a discrete token using the learned codebook $\mathbf{C}^v$ via $l_2$ similarity. These discrete tokens are embedded into $\mathbb{R}^{D'}$ and aggregated across the spatial grid to produce the final frame-level embedding input to the transformer. See Section \ref{['sec:tokenization']} for details.
  • Figure 3: Speech representation. We follow the speech discretization process from dMel bai2024dmel: each continuous mel-filterbank at time $t$ extracted from the raw audio is mapped into a discrete values using a codebook of evenly spaced values. Afterwards, each discretized log mel-filterbank at time $t$ is mapped through a learnable embedding layer, all representations for log mel-filterbanks at time $t$ are stacked together and resulting vector is projected by a learnable linear layer to the model dimension $D'$. All discretized log mel-filterbanks at time $t$ are predicted in parallel and independently.
  • Figure 4: Input sequence for Visatronic. We encode all modalities into a discrete token space (see Figures \ref{['fig:video_token']} and \ref{['fig:speech_token']}), which is directly consumed by the decoder-only transformer. Each modality’s discrete representation is indicated by a colored square. Each row illustrates a different strategy for combining multimodal information to learn temporal alignment across modalities: (top) text precedes video, which is followed by speech; (middle) text appears first, while speech and video are interleaved in temporal order such that speech generation at time $t$ attends to the full text and only past video tokens at $t' < t$; (bottom) video precedes text, which is followed by speech. The position sequence either uses global indexing across all modalities or aligns video and speech tokens by timestamp.
  • Figure 5: TimeSync. Visualization of phoneme-level alignment used for computing the TimeSync. Left: alignment in ground truth audio before (blue) and after (green) removing silence ("sp") segments. Right: aligned phoneme positions between ground truth (green) and generated (red) audio, where TimeSync is computed as the absolute difference between segment centers (measured in seconds).
  • ...and 11 more figures