Table of Contents
Fetching ...

Voice Impression Control in Zero-Shot TTS

Kenichi Fujita, Shota Horiguchi, Yusuke Ijima

TL;DR

This work tackles the challenge of controlling para-/non-linguistic voice impressions in zero-shot TTS. It introduces a low-dimensional voice impression vector and a control module that disentangles impression information from speaker embeddings and reintroduces it to modulate perceived voice impressions. The system is trained in two stages for stability and includes an optional LLM-based mapping to generate impression vectors from natural-language descriptions. Objective and subjective evaluations demonstrate effective impression control with preserved speaker identity, and an LLM-based approach enables target impressions without manual vector tuning, offering practical, flexible expressivity for zero-shot TTS.

Abstract

Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization. Audio examples are available on our demo page (https://ntt-hilab-gensp.github.io/is2025voiceimpression/).

Voice Impression Control in Zero-Shot TTS

TL;DR

This work tackles the challenge of controlling para-/non-linguistic voice impressions in zero-shot TTS. It introduces a low-dimensional voice impression vector and a control module that disentangles impression information from speaker embeddings and reintroduces it to modulate perceived voice impressions. The system is trained in two stages for stability and includes an optional LLM-based mapping to generate impression vectors from natural-language descriptions. Objective and subjective evaluations demonstrate effective impression control with preserved speaker identity, and an LLM-based approach enables target impressions without manual vector tuning, offering practical, flexible expressivity for zero-shot TTS.

Abstract

Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization. Audio examples are available on our demo page (https://ntt-hilab-gensp.github.io/is2025voiceimpression/).

Paper Structure

This paper contains 14 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of proposed method.
  • Figure 2: Voice impression vector generation via a LLM. Prompt examples are available on our demo pagefoot:sample_page.
  • Figure 3: Objective evaluation results for single-dimension modulation of the voice impression vector.
  • Figure 4: Change in scores when two dimensions (E and H) are simultaneously modulated.
  • Figure 5: Speaker similarity to target speaker's reference speech evaluated with Resemblyzer. "self" denotes generated utterances with and without modulation, with levels of $\pm1$, $\pm2$, and $\pm3$. "self/others (rec)" denotes different recorded speech of the same speaker and the recorded speech of different speakers of the same gender, respectively.