An Attribute Interpolation Method in Speech Synthesis by Model Merging
Masato Murata, Koichi Miyazaki, Tomoki Koriyama
TL;DR
This paper tackles the challenge of interpolating speech attributes (e.g., speaker identity, emotion intensity) in TTS without extra training or modules. It proposes a simple yet effective approach: merge two trained TTS models by weighted parameter averaging, defining $θ_m = \frac{1}{k} \sum_{i=1}^k θ_i$ and interpolating with $θ_α = (1-α) θ_A + α θ_B$, $0 ≤ α ≤ 1$. The method is validated on speaker generation and emotion intensity control using a Conformer-FastSpeech2 backbone and HiFi-GAN vocoder, showing smooth attribute interpolation while preserving linguistic content, with performance comparable to base models in the same-gender case and tunable emotion intensity. Results indicate practical potential for rapid, training-free creation of diverse voices and expressive styles from existing base models, though cross-gender and some emotion styles may require richer training data. This work highlights the value of weight-averaging-based transfer between pre-trained TTS models as a lightweight alternative to module-based attribute control.
Abstract
With the development of speech synthesis, recent research has focused on challenging tasks, such as speaker generation and emotion intensity control. Attribute interpolation is a common approach to these tasks. However, most previous methods for attribute interpolation require specific modules or training methods. We propose an attribute interpolation method in speech synthesis by model merging. Model merging is a method that creates new parameters by only averaging the parameters of base models. The merged model can generate an output with an intermediate feature of the base models. This method is easily applicable without specific modules or training methods, as it uses only existing trained base models. We merged two text-to-speech models to achieve attribute interpolation and evaluated its performance on speaker generation and emotion intensity control tasks. As a result, our proposed method achieved smooth attribute interpolation while keeping the linguistic content in both tasks.
