Table of Contents
Fetching ...

TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment

Trung Dang, Sharath Rao, Ananya Gupta, Christopher Gagne, Panagiotis Tzirakis, Alice Baird, Jakub Piotr Cłapa, Peter Chin, Alan Cowen

TL;DR

A novel tokenization scheme is proposed that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM, and achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.

Abstract

Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance--a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.

TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment

TL;DR

A novel tokenization scheme is proposed that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM, and achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.

Abstract

Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance--a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.
Paper Structure (24 sections, 2 equations, 5 figures, 6 tables)

This paper contains 24 sections, 2 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Word-level alignment via Viterbi decoding. An illustration of the forced alignment between a speech waveform and the transcript "That's exactly what happened...". The Viterbi algorithm identifies the most probable frame-level assignment to determine a position $p_i$ for each token $w_i$ in the encoded text sequence.
  • Figure 2: Operating under a Variational Autoencoder (VAE) framework, our model utilizes a symmetric encoder-decoder architecture. Each module integrates a CNN-based component for local acoustic feature extraction and reconstruction, complemented by a transformer-based backbone designed to capture the dynamic temporal range of synchronized speech-text sequences.
  • Figure 3: Attention mask of the encoder (left) and the decoder (right). Asterisks ($*$) mark text-assigned temporal indices. In the encoder, non-text-assigned positions are restricted to intra-block attention, excluding boundary tokens; conversely, text-assigned positions are permitted to attend across both preceding and succeeding blocks. The decoder also utilizes a localized mechanism where each position attends to the current and immediately preceding blocks.
  • Figure 4: Each text token $w_i$ is paired with the speech representation at the $K$-shifted position, comprising token features $s_{i-K}$, the number of preceeding frames $f_{i-K}^{\text{before}}$, the number of successive frames $f_{i-K}^{\text{after}}$, and processed by an autoregressive decoder. The decoder predicts the next text token and produces a conditioning vector, which the flow matching head uses to generate the next speech representation $\left(s_{i-K+1},f_{i-K+1}^{\text{before}},f_{i-K+1}^{\text{before}}\right)$.
  • Figure 5: Reconstruction quality (CER, SS, and oMOS) vs. token density for uniformly spaced tokens.