Table of Contents
Fetching ...

LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment

Jiahao Mei, Xuenan Xu, Zeyu Xie, Zihao Zheng, Ye Tao, Yue Ding, Mengyue Wu

TL;DR

The paper addresses the challenge of fine-grained emotional control in text-to-music generation, where textual prompts are semantically ambiguous for emotions. It introduces LARA-Gen, which uses Latent Affective Representation Alignment to supervise training by aligning the backbone's hidden states with external MERT-derived affective features via a Proxy Network, while supporting continuous valence-arousal conditioning through separate text and emotion prompts. An Emotion Predictor benchmark is proposed for objective evaluation of emotional controllability, and a curated dataset enables reproducible assessment. Empirical results show that LARA-Gen delivers continuous emotion control with improved generation quality and stronger alignment to target emotions than baselines, illustrating the practical potential for applications in music therapy and interactive media and establishing a new standard for objective evaluation in emotionally controlled music generation.

Abstract

Recent advances in text-to-music models have enabled coherent music generation from text prompts, yet fine-grained emotional control remains unresolved. We introduce LARA-Gen, a framework for continuous emotion control that aligns the internal hidden states with an external music understanding model through Latent Affective Representation Alignment (LARA), enabling effective training. In addition, we design an emotion control module based on a continuous valence-arousal space, disentangling emotional attributes from textual content and bypassing the bottlenecks of text-based prompting. Furthermore, we establish a benchmark with a curated test set and a robust Emotion Predictor, facilitating objective evaluation of emotional controllability in music generation. Extensive experiments demonstrate that LARA-Gen achieves continuous, fine-grained control of emotion and significantly outperforms baselines in both emotion adherence and music quality. Generated samples are available at https://nieeim.github.io/LARA-Gen/.

LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment

TL;DR

The paper addresses the challenge of fine-grained emotional control in text-to-music generation, where textual prompts are semantically ambiguous for emotions. It introduces LARA-Gen, which uses Latent Affective Representation Alignment to supervise training by aligning the backbone's hidden states with external MERT-derived affective features via a Proxy Network, while supporting continuous valence-arousal conditioning through separate text and emotion prompts. An Emotion Predictor benchmark is proposed for objective evaluation of emotional controllability, and a curated dataset enables reproducible assessment. Empirical results show that LARA-Gen delivers continuous emotion control with improved generation quality and stronger alignment to target emotions than baselines, illustrating the practical potential for applications in music therapy and interactive media and establishing a new standard for objective evaluation in emotionally controlled music generation.

Abstract

Recent advances in text-to-music models have enabled coherent music generation from text prompts, yet fine-grained emotional control remains unresolved. We introduce LARA-Gen, a framework for continuous emotion control that aligns the internal hidden states with an external music understanding model through Latent Affective Representation Alignment (LARA), enabling effective training. In addition, we design an emotion control module based on a continuous valence-arousal space, disentangling emotional attributes from textual content and bypassing the bottlenecks of text-based prompting. Furthermore, we establish a benchmark with a curated test set and a robust Emotion Predictor, facilitating objective evaluation of emotional controllability in music generation. Extensive experiments demonstrate that LARA-Gen achieves continuous, fine-grained control of emotion and significantly outperforms baselines in both emotion adherence and music quality. Generated samples are available at https://nieeim.github.io/LARA-Gen/.

Paper Structure

This paper contains 7 sections, 7 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: (a) LARA-Gen framework. A Proxy Network $\mathcal{P}_{\theta}$ aligns the internal hidden states $\mathbf{H}$ of the backbone model with target features $\bar{\mathbf{M}}$ from a frozen MERT encoder. (b) The architecture of Emotion Predictor. It uses a sliding window over MERT features and an Emotion Regression Head $\mathcal{R}_{\phi}$ to produce a final valence-arousal prediction from given music.
  • Figure 2: Predicted emotion values by Emotion Predictor vs. ground truth emotion values on DEAM test set, $\sigma$ denotes the standard deviation of the error. (1) The notable error on GT music highlights the out-of-domain prediction difficulty. (2) Arousal prediction is consistently more reliable than valence. (3) The LARA-Gen system outperforms the Emotion Text Prompting baseline in both error and correlation.