Table of Contents
Fetching ...

Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba

Baher Mohammad, Magauiya Zhussip, Stamatios Lefkimmiatis

TL;DR

MAVE introduces a hybrid autoregressive framework that combines a linear-time Mamba state-space decoder with cross-attention to phoneme-conditioned text, enabling high-fidelity speech editing and zero-shot TTS. By using input token rearrangement, cross-modal conditioning, and reference-context speaker cues, it achieves state-of-the-art performance on RealEdit editing benchmarks and superior zero-shot TTS metrics relative to VoiceCraft and FluentSpeech, while reducing memory usage by about sixfold. The approach demonstrates competitive or superior perceptual quality with single-pass generation and scalable long-form generation, addressing the core trade-offs between fidelity, efficiency, and context. This work suggests a new direction for scalable, high-quality speech generation that jointly handles editing and synthesis through structured state-space modeling and differentiable cross-attention.

Abstract

We introduce MAVE (Mamba with Cross-Attention for Voice Editing and Synthesis), a novel autoregressive architecture for text-conditioned voice editing and high-fidelity text-to-speech (TTS) synthesis, built on a cross-attentive Mamba backbone. MAVE achieves state-of-the-art performance in speech editing and very competitive results in zero-shot TTS, while not being explicitly trained on the latter task, outperforming leading autoregressive and diffusion models on diverse, real-world audio. By integrating Mamba for efficient audio sequence modeling with cross-attention for precise text-acoustic alignment, MAVE enables context-aware voice editing with exceptional naturalness and speaker consistency. In pairwise human evaluations on a random 40-sample subset of the RealEdit benchmark (400 judgments), 57.2% of listeners rated MAVE - edited speech as perceptually equal to the original, while 24.8% prefered the original and 18.0% MAVE - demonstrating that in the majority of cases edits are indistinguishable from the source. MAVE compares favorably with VoiceCraft and FluentSpeech both on pairwise comparisons and standalone mean opinion score (MOS) evaluations. For zero-shot TTS, MAVE exceeds VoiceCraft in both speaker similarity and naturalness, without requiring multiple inference runs or post-processing. Remarkably, these quality gains come with a significantly lower memory cost and approximately the same latency: MAVE requires ~6x less memory than VoiceCraft during inference on utterances from the RealEdit database (mean duration: 6.21s, A100, FP16, batch size 1). Our results demonstrate that MAVE establishes a new standard for flexible, high-fidelity voice editing and synthesis through the synergistic integration of structured state-space modeling and cross-modal attention.

Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba

TL;DR

MAVE introduces a hybrid autoregressive framework that combines a linear-time Mamba state-space decoder with cross-attention to phoneme-conditioned text, enabling high-fidelity speech editing and zero-shot TTS. By using input token rearrangement, cross-modal conditioning, and reference-context speaker cues, it achieves state-of-the-art performance on RealEdit editing benchmarks and superior zero-shot TTS metrics relative to VoiceCraft and FluentSpeech, while reducing memory usage by about sixfold. The approach demonstrates competitive or superior perceptual quality with single-pass generation and scalable long-form generation, addressing the core trade-offs between fidelity, efficiency, and context. This work suggests a new direction for scalable, high-quality speech generation that jointly handles editing and synthesis through structured state-space modeling and differentiable cross-attention.

Abstract

We introduce MAVE (Mamba with Cross-Attention for Voice Editing and Synthesis), a novel autoregressive architecture for text-conditioned voice editing and high-fidelity text-to-speech (TTS) synthesis, built on a cross-attentive Mamba backbone. MAVE achieves state-of-the-art performance in speech editing and very competitive results in zero-shot TTS, while not being explicitly trained on the latter task, outperforming leading autoregressive and diffusion models on diverse, real-world audio. By integrating Mamba for efficient audio sequence modeling with cross-attention for precise text-acoustic alignment, MAVE enables context-aware voice editing with exceptional naturalness and speaker consistency. In pairwise human evaluations on a random 40-sample subset of the RealEdit benchmark (400 judgments), 57.2% of listeners rated MAVE - edited speech as perceptually equal to the original, while 24.8% prefered the original and 18.0% MAVE - demonstrating that in the majority of cases edits are indistinguishable from the source. MAVE compares favorably with VoiceCraft and FluentSpeech both on pairwise comparisons and standalone mean opinion score (MOS) evaluations. For zero-shot TTS, MAVE exceeds VoiceCraft in both speaker similarity and naturalness, without requiring multiple inference runs or post-processing. Remarkably, these quality gains come with a significantly lower memory cost and approximately the same latency: MAVE requires ~6x less memory than VoiceCraft during inference on utterances from the RealEdit database (mean duration: 6.21s, A100, FP16, batch size 1). Our results demonstrate that MAVE establishes a new standard for flexible, high-fidelity voice editing and synthesis through the synergistic integration of structured state-space modeling and cross-modal attention.

Paper Structure

This paper contains 31 sections, 12 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of the proposed MAVE architecture. The model accepts phonemized text and audio tokens as input. A causal masking and rearrangement strategy is applied to the audio tokens to enable bidirectional context for editing. The core of the model is a Mamba block for efficient sequence modeling, augmented with cross-attention layers to condition the audio generation on the text embeddings produced by a Transformer encoder.
  • Figure 2: Side-by-side comparison between MAVE (ours), VoiceCraft and FluentSpeech
  • Figure 3: The instructions to assess the quality of different audios in terms of naturalness and intelligibility
  • Figure 4: Example of questions users were asked to assess the naturalness for the speech editing task
  • Figure 5: Question example about pair-wise comparative study between our model, VoiceCraft and FluentSpeech
  • ...and 2 more figures