ScripTONES: Sentiment-Conditioned Music Generation for Movie Scripts
Vishruth Veerendranath, Vibha Masti, Utkarsh Gupta, Hrishit Chaudhuri, Gowri Srinivasa
TL;DR
ScripTONES addresses the challenge of generating film scores for scripts by a two-stage approach that first maps scene sentiment to the Valence-Arousal ($\mathit{VA}$) space using the NRC VAD lexicon, and then conditionally generates piano MIDI music with either a Transformer-based EMOPIA-CWT or a MusicVAE model augmented by attribute-vector arithmetic. It introduces continuous sentiment conditioning via latent vectors and discrete conditioning via quadrant labels, plus latent-space regularization (continuous and discrete) to improve alignment with sentiment. The authors validate the approach through a qualitative user study showing EMOPIA-CWT often yields more nuanced music, while MusicVAE enables fine-grained sentiment manipulation; they also demonstrate attribute-vector arithmetic and interpolation for evolving sentiment within scenes. Overall, the work provides a practical, controllable pipeline enabling low-cost, sentiment-aware music generation from movie scripts, with implications for indie and small-team productions.
Abstract
Film scores are considered an essential part of the film cinematic experience, but the process of film score generation is often expensive and infeasible for small-scale creators. Automating the process of film score composition would provide useful starting points for music in small projects. In this paper, we propose a two-stage pipeline for generating music from a movie script. The first phase is the Sentiment Analysis phase where the sentiment of a scene from the film script is encoded into the valence-arousal continuous space. The second phase is the Conditional Music Generation phase which takes as input the valence-arousal vector and conditionally generates piano MIDI music to match the sentiment. We study the efficacy of various music generation architectures by performing a qualitative user survey and propose methods to improve sentiment-conditioning in VAE architectures.
