Table of Contents
Fetching ...

Amuse: Human-AI Collaborative Songwriting with Multimodal Inspirations

Yewon Kim, Sung-Ju Lee, Chris Donahue

TL;DR

Amuse addresses a key shortcoming in AI-assisted songwriting by enabling multimodal inspirations to inform symbolic musical elements, specifically chord progressions. The approach couples multimodal large language model guidance with a unimodal chord prior and a rejection-sampling step to produce diverse, relevant, and musically coherent chords that integrate into Hookpad via a Chord Generator and Chord Transcriber. Through a formative study and a user study, Amuse is shown to enhance perceived agency, creativity, and workflow efficiency while maintaining output coherence, and it reveals practical patterns for multimodal-influenced compositional work. The work advances multimodal creativity support in music and suggests future directions toward real-time multimodal-contextual systems and deeper integration across musical outputs.

Abstract

Songwriting is often driven by multimodal inspirations, such as imagery, narratives, or existing music, yet songwriters remain unsupported by current music AI systems in incorporating these multimodal inputs into their creative processes. We introduce Amuse, a songwriting assistant that transforms multimodal (image, text, or audio) inputs into chord progressions that can be seamlessly incorporated into songwriters' creative processes. A key feature of Amuse is its novel method for generating coherent chords that are relevant to music keywords in the absence of datasets with paired examples of multimodal inputs and chords. Specifically, we propose a method that leverages multimodal large language models (LLMs) to convert multimodal inputs into noisy chord suggestions and uses a unimodal chord model to filter the suggestions. A user study with songwriters shows that Amuse effectively supports transforming multimodal ideas into coherent musical suggestions, enhancing users' agency and creativity throughout the songwriting process.

Amuse: Human-AI Collaborative Songwriting with Multimodal Inspirations

TL;DR

Amuse addresses a key shortcoming in AI-assisted songwriting by enabling multimodal inspirations to inform symbolic musical elements, specifically chord progressions. The approach couples multimodal large language model guidance with a unimodal chord prior and a rejection-sampling step to produce diverse, relevant, and musically coherent chords that integrate into Hookpad via a Chord Generator and Chord Transcriber. Through a formative study and a user study, Amuse is shown to enhance perceived agency, creativity, and workflow efficiency while maintaining output coherence, and it reveals practical patterns for multimodal-influenced compositional work. The work advances multimodal creativity support in music and suggests future directions toward real-time multimodal-contextual systems and deeper integration across musical outputs.

Abstract

Songwriting is often driven by multimodal inspirations, such as imagery, narratives, or existing music, yet songwriters remain unsupported by current music AI systems in incorporating these multimodal inputs into their creative processes. We introduce Amuse, a songwriting assistant that transforms multimodal (image, text, or audio) inputs into chord progressions that can be seamlessly incorporated into songwriters' creative processes. A key feature of Amuse is its novel method for generating coherent chords that are relevant to music keywords in the absence of datasets with paired examples of multimodal inputs and chords. Specifically, we propose a method that leverages multimodal large language models (LLMs) to convert multimodal inputs into noisy chord suggestions and uses a unimodal chord model to filter the suggestions. A user study with songwriters shows that Amuse effectively supports transforming multimodal ideas into coherent musical suggestions, enhancing users' agency and creativity throughout the songwriting process.

Paper Structure

This paper contains 86 sections, 2 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Amuse transforms multimodal (image, text, or audio) inspirations into reusable musical elements (chord progressions) that songwriters can seamlessly incorporate into their creative process. Amuse consists of two functionalities: Chord Generator (Left) and Chord Transcriber (Right). In the Chord Generator, user can generate music keywords from image/text inputs and generate musically coherent chord progressions based on the music keywords. These suggestions are generated by rejection-sampling the LLM-generated chord progressions using a unimodal chord model. The Chord Transcriber allows users to transcribe chords from a specified range of audio.
  • Figure 2: Screenshot of the songwriting interface used in the user study. The main workspace is the Hookpad interface (A), where users can input melodies and chords using either the keyboard or MIDI devices. Amuse (B), a Chrome extension, appears as a floating window within the Hookpad interface. Users can freely open, close, move, and resize this window. The current view of Amuse displays the Chord Generator, which generates chord progressions from user-provided images or text. Participants also used Aria (C), a tool for generating melodies and chords based on the content already written in the Hookpad interface.
  • Figure 3: Overview of Chord Generator in Amuse. (A) Initial Interface: Users can upload an image or type text, which are used to generate music keywords. Users can also directly write music keywords in the keyword editor. (B) Keyword Extraction: Upon clicking the "Generate Keywords" button, Amuse suggests music keywords based on the multimodal inputs. User-selected keywords are automatically pasted into the keyword editor ('acoustic, 'mellow,' and 'indie-pop' in the figure). (C) Keyword-based Chord Generation: Upon clicking the 'Generate Chords' button, Amuse suggests four chord progressions based on the keywords. Users can choose between 3-bar or 4-bar progressions (default: 4 bars). The key is automatically detected from the song configuration in Hookpad (G Maj in the figure). Clicking a chord progression automatically pastes it into Hookpad (in the figure, 'Amaj7-Em7-A7-Dmaj7' and 'Bm9-E7-C#m7-F#m7' are selected), where users can play the audio and make further edits.
  • Figure 4: Overview of Chord Transcriber in Amuse. (A) Initial Interface: Users can upload a local audio file or enter a YouTube URL. (B) Audio Inspiration Input: With an audio preview, users can select the desired segment for transcription by specifying start and end times (maximum 30 seconds). (C) Chord Transcription: Amuse detects the key (shown as Gb Min) and chords of the selected audio segment. Users can play the audio segment and the chord is highlighted in sync with the playback (in this figure, C#m/E). Clicking on a chord pastes it into the Hookpad interface. If the "Convert to Hookpad Key" option is checked, the chords are transposed to match the key the user is working on (G Maj in the figure).
  • Figure 5: Results from our listening study where listeners indicated a preference between pairs of chord progression audio clips generated by two different methods among LSTM Prior, GPT-4o, or Amuse. Each row indicates the % of times listeners preferred audio from that system compared to those from any other system (first column, N=300) and each system individually (other columns, N=150). Wilcoxon signed-rank test with these paired data reveals that for Musical Coherence, Amuse shows no significant difference compared to LSTM Prior, which aligns closely with real music distributions (55.3% of LSTM Prior generations preferred over Amuse). For Keyword Relevance, Amuse is most preferred by users (58% of cases against any other samples), significantly outperforming LSTM Prior. (**$p$<.01, ***$p$<.001).
  • ...and 6 more figures