Table of Contents
Fetching ...

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Tianxin Xie, Shan Yang, Chenxing Li, Dong Yu, Li Liu

TL;DR

EmoSteer-TTS tackles the limitation of coarse emotion control in text-to-speech by introducing a training-free method that steers internal activations in flow-matching TTS models. The approach extracts emotion-relevant activation differences across DiT layers, constructs per-layer steering vectors from top-$k$ emotionally salient tokens, and applies inference-time perturbations, enabling continuous emotion conversion, interpolation, erasure, and composite control without retraining. Evaluations across multiple pretrained backbones and a curated six-emotion dataset show superior fine-grained control and robustness, including zero-shot generalization to out-of-domain data, with competitive objective metrics and strong subjective ratings. This work provides both practical EC-TTS capabilities and deeper insights into emotion steering dynamics in diffusion-transformer–based TTS, laying groundwork for zero-shot, interpretable expressive synthesis in real-world applications.

Abstract

Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS. Demo samples are available at https://emosteer-tts-demo.pages.dev/.

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

TL;DR

EmoSteer-TTS tackles the limitation of coarse emotion control in text-to-speech by introducing a training-free method that steers internal activations in flow-matching TTS models. The approach extracts emotion-relevant activation differences across DiT layers, constructs per-layer steering vectors from top- emotionally salient tokens, and applies inference-time perturbations, enabling continuous emotion conversion, interpolation, erasure, and composite control without retraining. Evaluations across multiple pretrained backbones and a curated six-emotion dataset show superior fine-grained control and robustness, including zero-shot generalization to out-of-domain data, with competitive objective metrics and strong subjective ratings. This work provides both practical EC-TTS capabilities and deeper insights into emotion steering dynamics in diffusion-transformer–based TTS, laying groundwork for zero-shot, interpretable expressive synthesis in real-world applications.

Abstract

Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS. Demo samples are available at https://emosteer-tts-demo.pages.dev/.

Paper Structure

This paper contains 34 sections, 11 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Motivations of our work. (a) Existing paradigm for speech emotion control. (b) EmoSteer-TTS offers training-free, fine-grained continuous emotion control with improved interpretability.
  • Figure 2: Adding a sadness steering vector to the activations in five DiT layers (1, 6, 11, 16, 21) of F5-TTS, conditioned on neutral speech, substantially increases the predicted sadness probability.
  • Figure 3: Overview of EmoSteer-TTS. Steering vectors and steering weights are derived from pairs of neutral and emotional reference speech. During inference, these vectors are used to modulate the activations in a TTS model, guiding it to synthesize speech that reflects the desired emotion.
  • Figure 4: Emotion steering results on MSP-Podcast and ESD. ● emotion2vec, ▲ SenseVoice.
  • Figure 5: Visualization of F0 contours. (a) An example showing how the F0 contour varies with steering intensity; (b) The speech tone (F0 contour) becomes calmer after emotion erasure.
  • ...and 5 more figures