Table of Contents
Fetching ...

Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

Xiutian Zhao, Ismail Rasim Ulgen, Philipp Koehn, Björn Schuller, Berrak Sisman

Abstract

Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are identified via success-filtered activation aggregation enforcing both emotion realization and content preservation. Across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio), ESN interventions yield emotion-specific gains that generalize to unseen speakers and are supported by automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and intervention strength. Our results establish a mechanistic framework for training-free emotion control in speech generation.

Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

Abstract

Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are identified via success-filtered activation aggregation enforcing both emotion realization and content preservation. Across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio), ESN interventions yield emotion-specific gains that generalize to unseen speakers and are supported by automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and intervention strength. Our results establish a mechanistic framework for training-free emotion control in speech generation.
Paper Structure (29 sections, 2 equations, 10 figures, 6 tables)

This paper contains 29 sections, 2 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: EVC with LALMs is inherently multi-objective: successful conversion requires both target-emotion realization and linguistic content preservation.
  • Figure 2: The overview of our four-stage pipeline for identifying and manipulating emotion-sensitive neurons in LALMs for EVC.
  • Figure 3: Comparison of ESN identification methods under activation steering, reported relative to the unintervened baseline in Table \ref{['tab:baseline']}.
  • Figure 4: Sensitivity to ESN selection rate $r$ (Qwen2.5-Omni-7B, CAS, $c{=}50$, steering, $\alpha{=}1.0$).
  • Figure 5: Sensitivity to success instance size $c$ (Qwen2.5-Omni-7B, CAS, $r{=}0.5\%$, steering, $\alpha{=}1.0$).
  • ...and 5 more figures