An Attribute Interpolation Method in Speech Synthesis by Model Merging

Masato Murata; Koichi Miyazaki; Tomoki Koriyama

An Attribute Interpolation Method in Speech Synthesis by Model Merging

Masato Murata, Koichi Miyazaki, Tomoki Koriyama

TL;DR

This paper tackles the challenge of interpolating speech attributes (e.g., speaker identity, emotion intensity) in TTS without extra training or modules. It proposes a simple yet effective approach: merge two trained TTS models by weighted parameter averaging, defining $θ_m = \frac{1}{k} \sum_{i=1}^k θ_i$ and interpolating with $θ_α = (1-α) θ_A + α θ_B$, $0 ≤ α ≤ 1$. The method is validated on speaker generation and emotion intensity control using a Conformer-FastSpeech2 backbone and HiFi-GAN vocoder, showing smooth attribute interpolation while preserving linguistic content, with performance comparable to base models in the same-gender case and tunable emotion intensity. Results indicate practical potential for rapid, training-free creation of diverse voices and expressive styles from existing base models, though cross-gender and some emotion styles may require richer training data. This work highlights the value of weight-averaging-based transfer between pre-trained TTS models as a lightweight alternative to module-based attribute control.

Abstract

With the development of speech synthesis, recent research has focused on challenging tasks, such as speaker generation and emotion intensity control. Attribute interpolation is a common approach to these tasks. However, most previous methods for attribute interpolation require specific modules or training methods. We propose an attribute interpolation method in speech synthesis by model merging. Model merging is a method that creates new parameters by only averaging the parameters of base models. The merged model can generate an output with an intermediate feature of the base models. This method is easily applicable without specific modules or training methods, as it uses only existing trained base models. We merged two text-to-speech models to achieve attribute interpolation and evaluated its performance on speaker generation and emotion intensity control tasks. As a result, our proposed method achieved smooth attribute interpolation while keeping the linguistic content in both tasks.

An Attribute Interpolation Method in Speech Synthesis by Model Merging

TL;DR

and interpolating with

. The method is validated on speaker generation and emotion intensity control using a Conformer-FastSpeech2 backbone and HiFi-GAN vocoder, showing smooth attribute interpolation while preserving linguistic content, with performance comparable to base models in the same-gender case and tunable emotion intensity. Results indicate practical potential for rapid, training-free creation of diverse voices and expressive styles from existing base models, though cross-gender and some emotion styles may require richer training data. This work highlights the value of weight-averaging-based transfer between pre-trained TTS models as a lightweight alternative to module-based attribute control.

Abstract

Paper Structure (15 sections, 2 equations, 7 figures, 2 tables)

This paper contains 15 sections, 2 equations, 7 figures, 2 tables.

Introduction
Related work
Speaker Generation
Emotion Intensity Control
Model Merging Method
Attribute Interpolation Method by Model Merging
Experiments
Experimental settings
Datasets
Model Architecture
Speaker generation
Speech quality evaluation
Speaker interpolation smoothness
Emotion intensity control
Conclusions

Figures (7)

Figure 1: Overview of the attribute interpolation method by model merging.
Figure 2: Female-Female (spk emb)
Figure 3: Female-Female (model merge)
Figure 4: Male-Male (spk emb)
Figure 5: Male-Male (model merge)
...and 2 more figures

An Attribute Interpolation Method in Speech Synthesis by Model Merging

TL;DR

Abstract

An Attribute Interpolation Method in Speech Synthesis by Model Merging

Authors

TL;DR

Abstract

Table of Contents

Figures (7)