Table of Contents
Fetching ...

MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model

Matthew Baas, Pieter Scholtz, Arnav Mehta, Elliott Dyson, Akshat Prakash, Herman Kamper

TL;DR

MARS6 introduces a compact 70M-parameter SLM-based TTS model with a hierarchical decoder and a multi-rate SNAC-based acoustic codec, achieving robust, expressive speech with fast inference at $12$ Hz token processing. By integrating techniques such as ORPO, RIO, flux loss, Repetition Aware Sampling, quality prefixing, and top-$p$ backoff, the approach stabilizes outputs and enhances speaker cloning while maintaining competitive objective and subjective quality against much larger models. The work demonstrates that a carefully curated set of training-time and inference-time tricks, together with a hierarchical decoding strategy, can yield strong performance on challenging in-the-wild data (EARS) without requiring heavy resources or phoneme alignments. This has practical impact for expressive TTS and voice cloning applications where model size, speed, and robustness are critical.

Abstract

Codec-based text-to-speech (TTS) models have shown impressive quality with zero-shot voice cloning abilities. However, they often struggle with more expressive references or complex text inputs. We present MARS6, a robust encoder-decoder transformer for rapid, expressive TTS. MARS6 is built on recent improvements in spoken language modelling. Utilizing a hierarchical setup for its decoder, new speech tokens are processed at a rate of only 12 Hz, enabling efficient modelling of long-form text while retaining reconstruction quality. We combine several recent training and inference techniques to reduce repetitive generation and improve output stability and quality. This enables the 70M-parameter MARS6 to achieve similar performance to models many times larger. We show this in objective and subjective evaluations, comparing TTS output quality and reference speaker cloning ability. Project page: https://camb-ai.github.io/mars6-turbo/

MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model

TL;DR

MARS6 introduces a compact 70M-parameter SLM-based TTS model with a hierarchical decoder and a multi-rate SNAC-based acoustic codec, achieving robust, expressive speech with fast inference at Hz token processing. By integrating techniques such as ORPO, RIO, flux loss, Repetition Aware Sampling, quality prefixing, and top- backoff, the approach stabilizes outputs and enhances speaker cloning while maintaining competitive objective and subjective quality against much larger models. The work demonstrates that a carefully curated set of training-time and inference-time tricks, together with a hierarchical decoding strategy, can yield strong performance on challenging in-the-wild data (EARS) without requiring heavy resources or phoneme alignments. This has practical impact for expressive TTS and voice cloning applications where model size, speed, and robustness are critical.

Abstract

Codec-based text-to-speech (TTS) models have shown impressive quality with zero-shot voice cloning abilities. However, they often struggle with more expressive references or complex text inputs. We present MARS6, a robust encoder-decoder transformer for rapid, expressive TTS. MARS6 is built on recent improvements in spoken language modelling. Utilizing a hierarchical setup for its decoder, new speech tokens are processed at a rate of only 12 Hz, enabling efficient modelling of long-form text while retaining reconstruction quality. We combine several recent training and inference techniques to reduce repetitive generation and improve output stability and quality. This enables the 70M-parameter MARS6 to achieve similar performance to models many times larger. We show this in objective and subjective evaluations, comparing TTS output quality and reference speaker cloning ability. Project page: https://camb-ai.github.io/mars6-turbo/
Paper Structure (27 sections, 1 equation, 2 figures, 1 table)

This paper contains 27 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: MARS6 is an encoder-decoder transformer. The encoder converts a speaker embedding and sequence of text embeddings to latent vectors for cross-attention in the global decoder. The hierarchical autoregressive decoder has two parts: The global decoder produces new latent vectors at a low sample rate, where each vector is autoregressively decoded to acoustic tokens using a smaller local decoder model. The entire patch of acoustic tokens then forms the next input vector to the global decoder through a patch embedding.
  • Figure 2: Comparison of word error rates for different speaker reference styles.