Table of Contents
Fetching ...

Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward

Guansu Wang, Peijie Sun

TL;DR

This work tackles the mismatch between coarse utterance-level evaluation and fine-grained speech errors in autoregressive TTS. It introduces W3AR, a framework that leverages cross-attention from a pre-trained ASR model to produce word-level rewards via Attention Purity and Alignment Monotonicity, guiding a stable group-relative policy optimization. Empirical results show significant improvements across state-of-the-art TTS models and strong generalization to unseen speakers, including out-of-domain datasets, with both objective and subjective gains. The findings advocate a broader paradigm where an expert model serves as an evaluative supervisor to provide informative, fine-grained feedback for generative systems. Overall, W3AR demonstrates a practical, model-agnostic approach to refining TTS by translating internal evaluative signals into targeted learning signals.

Abstract

Recent advances in text-to-speech (TTS) have enabled models to clone arbitrary unseen speakers and synthesize high-quality, natural-sounding speech. However, evaluation methods lag behind: typical mean opinion score (MOS) estimators perform regression over entire utterances, while failures usually occur in a few problematic words. We observe that encoder-decoder ASR models (e.g., Whisper) surface word-level mismatches between speech and text via cross-attention, providing a fine-grained reward signal. Building on this, we introduce Word-level TTS Alignment by ASR-driven Attentive Reward (W3AR). Without explicit reward annotations, W3AR uses attention from a pre-trained ASR model to drive finer-grained alignment and optimization of sequences predicted by a TTS model. Experiments show that W3AR improves the quality of existing TTS systems and strengthens zero-shot robustness on unseen speakers. More broadly, our results suggest a simple recipe for generative modeling: understanding models can act as evaluators, delivering informative, fine-grained feedback for optimization.

Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward

TL;DR

This work tackles the mismatch between coarse utterance-level evaluation and fine-grained speech errors in autoregressive TTS. It introduces W3AR, a framework that leverages cross-attention from a pre-trained ASR model to produce word-level rewards via Attention Purity and Alignment Monotonicity, guiding a stable group-relative policy optimization. Empirical results show significant improvements across state-of-the-art TTS models and strong generalization to unseen speakers, including out-of-domain datasets, with both objective and subjective gains. The findings advocate a broader paradigm where an expert model serves as an evaluative supervisor to provide informative, fine-grained feedback for generative systems. Overall, W3AR demonstrates a practical, model-agnostic approach to refining TTS by translating internal evaluative signals into targeted learning signals.

Abstract

Recent advances in text-to-speech (TTS) have enabled models to clone arbitrary unseen speakers and synthesize high-quality, natural-sounding speech. However, evaluation methods lag behind: typical mean opinion score (MOS) estimators perform regression over entire utterances, while failures usually occur in a few problematic words. We observe that encoder-decoder ASR models (e.g., Whisper) surface word-level mismatches between speech and text via cross-attention, providing a fine-grained reward signal. Building on this, we introduce Word-level TTS Alignment by ASR-driven Attentive Reward (W3AR). Without explicit reward annotations, W3AR uses attention from a pre-trained ASR model to drive finer-grained alignment and optimization of sequences predicted by a TTS model. Experiments show that W3AR improves the quality of existing TTS systems and strengthens zero-shot robustness on unseen speakers. More broadly, our results suggest a simple recipe for generative modeling: understanding models can act as evaluators, delivering informative, fine-grained feedback for optimization.

Paper Structure

This paper contains 26 sections, 9 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: An overview of our proposed W3AR framework. The left panel illustrates the core process: speech synthesized by a TTS model is fed into the ASR encoder, while the ground-truth text guides the ASR decoder. The cross-attention mechanism acts as a bridge, where the ASR decoder uses each text token as a query to probe the acoustic representations from the encoder. The right panel visually defines the failure modes captured by our two metrics: (Top) Low Purity, where attention for a single token is diffuse and scattered across many audio frames, indicating unclear articulation. (Bottom) Poor Monotonicity, where the attention peak stalls or regresses, indicating unnatural prosody or rhythm.
  • Figure 2: AB test results of W3AR (Ours) and Baseline (CosyVoice) from human listeners