Table of Contents
Fetching ...

Exploring Real-Time Music-to-Image Systems for Creative Inspiration in Music Creation

Meng Yang, Maria Teresa Llano, Jon McCormack

TL;DR

The paper presents a real-time music-to-image system that uses live MIDI input translated into ABC notation, which GPT-4 analyzes to infer musical features and perceived emotion, generating corresponding visual prompts for Stable Diffusion to display to the musician. A pilot user study with five musicians investigates how the generated imagery influences improvisation versus composition, and examines divergent versus convergent image-generation modes, pacing, and controllability. Findings indicate the imagery can enrich improvisational creativity by providing immersive cross-modal cues, though emotional alignment and responsiveness vary across users, highlighting a need for richer mappings and faster feedback. The work demonstrates the feasibility of cross-modal inspiration in creative practice and offers design guidelines and future directions for more adaptive, expressive, and controllable AI-assisted music workflows.

Abstract

This paper presents a study on the use of a real-time music-to-image system as a mechanism to support and inspire musicians during their creative process. The system takes MIDI messages from a keyboard as input which are then interpreted and analysed using state-of-the-art generative AI models. Based on the perceived emotion and music structure, the system's interpretation is converted into visual imagery that is presented in real-time to musicians. We conducted a user study in which musicians improvised and composed using the system. Our findings show that most musicians found the generated images were a novel mechanism when playing, evidencing the potential of music-to-image systems to inspire and enhance their creative process.

Exploring Real-Time Music-to-Image Systems for Creative Inspiration in Music Creation

TL;DR

The paper presents a real-time music-to-image system that uses live MIDI input translated into ABC notation, which GPT-4 analyzes to infer musical features and perceived emotion, generating corresponding visual prompts for Stable Diffusion to display to the musician. A pilot user study with five musicians investigates how the generated imagery influences improvisation versus composition, and examines divergent versus convergent image-generation modes, pacing, and controllability. Findings indicate the imagery can enrich improvisational creativity by providing immersive cross-modal cues, though emotional alignment and responsiveness vary across users, highlighting a need for richer mappings and faster feedback. The work demonstrates the feasibility of cross-modal inspiration in creative practice and offers design guidelines and future directions for more adaptive, expressive, and controllable AI-assisted music workflows.

Abstract

This paper presents a study on the use of a real-time music-to-image system as a mechanism to support and inspire musicians during their creative process. The system takes MIDI messages from a keyboard as input which are then interpreted and analysed using state-of-the-art generative AI models. Based on the perceived emotion and music structure, the system's interpretation is converted into visual imagery that is presented in real-time to musicians. We conducted a user study in which musicians improvised and composed using the system. Our findings show that most musicians found the generated images were a novel mechanism when playing, evidencing the potential of music-to-image systems to inspire and enhance their creative process.
Paper Structure (24 sections, 1 figure, 5 tables)