Table of Contents
Fetching ...

Story2MIDI: Emotionally Aligned Music Generation from Text

Mohammad Shokri, Alexandra C. Salem, Gabriel Levine, Johanna Devaney, Sarah Ita Levitan

TL;DR

Story2MIDI tackles the challenge of generating emotion-aligned music from narrative text by constructing a cross-modal dataset that maps textual emotions to music via valence–arousal quadrants. The approach combines a RoBERTa-based encoder trained with supervised contrastive learning and a decoder pre-trained on large symbolic-music data (GiantMIDI-Piano) before fine-tuning on EMOPIA, with REMI tokenization for MIDI. The work introduces the Story2MIDI dataset, reports both subjective listening evaluations and objective AV-related metrics, and demonstrates partial success in capturing arousal and some valence cues, while highlighting biases and areas for improvement. Overall, the paper provides a feasible framework for narrative-to-music emotion alignment and a foundation for richer cross-modal music generation research. NSF-supported results indicate a promising direction for emotionally coherent AI-assisted storytelling.

Abstract

In this paper, we introduce Story2MIDI, a sequence-to-sequence Transformer-based model for generating emotion-aligned music from a given piece of text. To develop this model, we construct the Story2MIDI dataset by merging existing datasets for sentiment analysis from text and emotion classification in music. The resulting dataset contains pairs of text blurbs and music pieces that evoke the same emotions in the reader or listener. Despite the small scale of our dataset and limited computational resources, our results indicate that our model effectively learns emotion-relevant features in music and incorporates them into its generation process, producing samples with diverse emotional responses. We evaluate the generated outputs using objective musical metrics and a human listening study, confirming the model's ability to capture intended emotional cues.

Story2MIDI: Emotionally Aligned Music Generation from Text

TL;DR

Story2MIDI tackles the challenge of generating emotion-aligned music from narrative text by constructing a cross-modal dataset that maps textual emotions to music via valence–arousal quadrants. The approach combines a RoBERTa-based encoder trained with supervised contrastive learning and a decoder pre-trained on large symbolic-music data (GiantMIDI-Piano) before fine-tuning on EMOPIA, with REMI tokenization for MIDI. The work introduces the Story2MIDI dataset, reports both subjective listening evaluations and objective AV-related metrics, and demonstrates partial success in capturing arousal and some valence cues, while highlighting biases and areas for improvement. Overall, the paper provides a feasible framework for narrative-to-music emotion alignment and a foundation for richer cross-modal music generation research. NSF-supported results indicate a promising direction for emotionally coherent AI-assisted storytelling.

Abstract

In this paper, we introduce Story2MIDI, a sequence-to-sequence Transformer-based model for generating emotion-aligned music from a given piece of text. To develop this model, we construct the Story2MIDI dataset by merging existing datasets for sentiment analysis from text and emotion classification in music. The resulting dataset contains pairs of text blurbs and music pieces that evoke the same emotions in the reader or listener. Despite the small scale of our dataset and limited computational resources, our results indicate that our model effectively learns emotion-relevant features in music and incorporates them into its generation process, producing samples with diverse emotional responses. We evaluate the generated outputs using objective musical metrics and a human listening study, confirming the model's ability to capture intended emotional cues.

Paper Structure

This paper contains 21 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: 27 different emotion classes from GoEmotions dataset located on the AV plane. Arousal and valence scores were looked up from the NRC VAD lexicon.
  • Figure 2: Overview of Story2MIDI. The encoder of our model is a RoBERTa encoder which is contrastively fine-tuned to distinguish emotions better. The decoder of our model is a transformer decoder with 3 decoder layers and four self-attention heads per decoder layer.
  • Figure 3: Valence-arousal metrics comparing our generated samples after 300 epochs of fine-tuning with the EMOPIA dataset. The plots show differences in average note length, note velocity, and major key ratio across the four emotion quadrants (Q1–Q4). In the Major Key Ratio plot, generated samples exhibit a statistically significant difference between positive-valence (Q1–Q2) and negative-valence (Q3–Q4) quadrants, reflecting valence sensitivity in our model. Similarly, the Average Note Length plot shows a significant difference between high-arousal and low-arousal quadrants, indicating the model’s ability to capture arousal variation through musical structure.
  • Figure 4: t-SNE visualization of story embeddings before and after contrastive training. Contrastive fine-tuning leads to more distinct quadrant clusters.