Amuse: Human-AI Collaborative Songwriting with Multimodal Inspirations
Yewon Kim, Sung-Ju Lee, Chris Donahue
TL;DR
Amuse addresses a key shortcoming in AI-assisted songwriting by enabling multimodal inspirations to inform symbolic musical elements, specifically chord progressions. The approach couples multimodal large language model guidance with a unimodal chord prior and a rejection-sampling step to produce diverse, relevant, and musically coherent chords that integrate into Hookpad via a Chord Generator and Chord Transcriber. Through a formative study and a user study, Amuse is shown to enhance perceived agency, creativity, and workflow efficiency while maintaining output coherence, and it reveals practical patterns for multimodal-influenced compositional work. The work advances multimodal creativity support in music and suggests future directions toward real-time multimodal-contextual systems and deeper integration across musical outputs.
Abstract
Songwriting is often driven by multimodal inspirations, such as imagery, narratives, or existing music, yet songwriters remain unsupported by current music AI systems in incorporating these multimodal inputs into their creative processes. We introduce Amuse, a songwriting assistant that transforms multimodal (image, text, or audio) inputs into chord progressions that can be seamlessly incorporated into songwriters' creative processes. A key feature of Amuse is its novel method for generating coherent chords that are relevant to music keywords in the absence of datasets with paired examples of multimodal inputs and chords. Specifically, we propose a method that leverages multimodal large language models (LLMs) to convert multimodal inputs into noisy chord suggestions and uses a unimodal chord model to filter the suggestions. A user study with songwriters shows that Amuse effectively supports transforming multimodal ideas into coherent musical suggestions, enhancing users' agency and creativity throughout the songwriting process.
