LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment
Jiahao Mei, Xuenan Xu, Zeyu Xie, Zihao Zheng, Ye Tao, Yue Ding, Mengyue Wu
TL;DR
The paper addresses the challenge of fine-grained emotional control in text-to-music generation, where textual prompts are semantically ambiguous for emotions. It introduces LARA-Gen, which uses Latent Affective Representation Alignment to supervise training by aligning the backbone's hidden states with external MERT-derived affective features via a Proxy Network, while supporting continuous valence-arousal conditioning through separate text and emotion prompts. An Emotion Predictor benchmark is proposed for objective evaluation of emotional controllability, and a curated dataset enables reproducible assessment. Empirical results show that LARA-Gen delivers continuous emotion control with improved generation quality and stronger alignment to target emotions than baselines, illustrating the practical potential for applications in music therapy and interactive media and establishing a new standard for objective evaluation in emotionally controlled music generation.
Abstract
Recent advances in text-to-music models have enabled coherent music generation from text prompts, yet fine-grained emotional control remains unresolved. We introduce LARA-Gen, a framework for continuous emotion control that aligns the internal hidden states with an external music understanding model through Latent Affective Representation Alignment (LARA), enabling effective training. In addition, we design an emotion control module based on a continuous valence-arousal space, disentangling emotional attributes from textual content and bypassing the bottlenecks of text-based prompting. Furthermore, we establish a benchmark with a curated test set and a robust Emotion Predictor, facilitating objective evaluation of emotional controllability in music generation. Extensive experiments demonstrate that LARA-Gen achieves continuous, fine-grained control of emotion and significantly outperforms baselines in both emotion adherence and music quality. Generated samples are available at https://nieeim.github.io/LARA-Gen/.
