Table of Contents
Fetching ...

Activation Steering for Accent-Neutralized Zero-Shot Text-To-Speech

Mu Yang, John H. L. Hansen

TL;DR

This study introduces a novel, post-hoc, and training-free approach to neutralize accent while preserving the speaker's original timbre, utilizing inference-time activation steering.

Abstract

Zero-shot Text-to-Speech (TTS) models can generate speech that captures both the voice timbre and accent of a reference speaker. However, disentangling these attributes remains challenging, as the output often inherits both the accent and timbre from the reference. In this study, we introduce a novel, post-hoc, and training-free approach to neutralize accent while preserving the speaker's original timbre, utilizing inference-time activation steering. We first extract layer-specific "steering vectors" offline, which are derived from the internal activation differences within the TTS model between accented and native speech. During inference, the steering vectors are applied to guide the model to produce accent-neutralized, timbre-preserving speech. Empirical results demonstrate that the proposed steering vectors effectively mitigate the output accent and exhibit strong generalizability to unseen accented speakers, offering a practical solution for accent-free voice cloning.

Activation Steering for Accent-Neutralized Zero-Shot Text-To-Speech

TL;DR

This study introduces a novel, post-hoc, and training-free approach to neutralize accent while preserving the speaker's original timbre, utilizing inference-time activation steering.

Abstract

Zero-shot Text-to-Speech (TTS) models can generate speech that captures both the voice timbre and accent of a reference speaker. However, disentangling these attributes remains challenging, as the output often inherits both the accent and timbre from the reference. In this study, we introduce a novel, post-hoc, and training-free approach to neutralize accent while preserving the speaker's original timbre, utilizing inference-time activation steering. We first extract layer-specific "steering vectors" offline, which are derived from the internal activation differences within the TTS model between accented and native speech. During inference, the steering vectors are applied to guide the model to produce accent-neutralized, timbre-preserving speech. Empirical results demonstrate that the proposed steering vectors effectively mitigate the output accent and exhibit strong generalizability to unseen accented speakers, offering a practical solution for accent-free voice cloning.
Paper Structure (13 sections, 2 equations, 2 figures, 2 tables)

This paper contains 13 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The proposed activation steering framework for accent-neutralized zero-shot TTS. (a) Steering vectors are extracted offline from the activation differences between accented and neutral speech, and (b) applied during inference to guide the model towards accent-neutralized output while preserving timbre. In this paper, we experiment with single-layer steering, i.e., only one layer is steered while other layers are left unchanged. The figure illustrates the general framework that multiple layers can be steered simultaneously.
  • Figure 2: Layerwise single-layer steering analyses on L2-ARCTIC. Layers are zero-indexed (layer 1 means the 2nd layer). Solid and dashed lines represent the results of steering with different models and steering strengths. Dotted lines represent the unsteered baseline.