Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

Xiutian Zhao; Ismail Rasim Ulgen; Philipp Koehn; Björn Schuller; Berrak Sisman

Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

Xiutian Zhao, Ismail Rasim Ulgen, Philipp Koehn, Björn Schuller, Berrak Sisman

Abstract

Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are identified via success-filtered activation aggregation enforcing both emotion realization and content preservation. Across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio), ESN interventions yield emotion-specific gains that generalize to unseen speakers and are supported by automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and intervention strength. Our results establish a mechanistic framework for training-free emotion control in speech generation.

Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

Abstract

Paper Structure (29 sections, 2 equations, 10 figures, 6 tables)

This paper contains 29 sections, 2 equations, 10 figures, 6 tables.

Introduction
Related Work
Methodology
Activation Sampling
Success Filtering and Set Construction
Activation Aggregation and Emotion-Sensitive Neuron Identification
Activation Statistics on Filtered Successes
Neuron Ranking Methods
Inference-Time Emotion Control via Neuron-Level Interventions
Evaluation Metrics and Protocol
Experiment Setup
Models
Dataset
Task Implementation
Results
...and 14 more sections

Figures (10)

Figure 1: EVC with LALMs is inherently multi-objective: successful conversion requires both target-emotion realization and linguistic content preservation.
Figure 2: The overview of our four-stage pipeline for identifying and manipulating emotion-sensitive neurons in LALMs for EVC.
Figure 3: Comparison of ESN identification methods under activation steering, reported relative to the unintervened baseline in Table \ref{['tab:baseline']}.
Figure 4: Sensitivity to ESN selection rate $r$ (Qwen2.5-Omni-7B, CAS, $c{=}50$, steering, $\alpha{=}1.0$).
Figure 5: Sensitivity to success instance size $c$ (Qwen2.5-Omni-7B, CAS, $r{=}0.5\%$, steering, $\alpha{=}1.0$).
...and 5 more figures

Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

Abstract

Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

Authors

Abstract

Table of Contents

Figures (10)