Table of Contents
Fetching ...

Pictures Of MIDI: Controlled Music Generation via Graphical Prompts for Image-Based Diffusion Inpainting

Scott H. Hawley

TL;DR

The paper addresses the challenge of giving users intuitive, shape-based control over generative music. It adopts a pixel-space diffusion approach using an Hourglass Diffusion Transformer (HDiT) to perform inpainting on MIDI piano-roll images, augmented by a RePaint mechanism to increase note density within users' masks. Key contributions include achieving comparable quality to prior image-driven diffusion methods (e.g., Polyffusion) while enabling longer contexts, eliminating the need for an autoencoder, and supporting complex, arbitrarily shaped inpainting regions with explicit note-velocity embedding via color borders. The results demonstrate standard and creative inpainting tasks, with improved control over density and structure, and extensive evaluation combining objective metrics and human listening tests. The approach offers a practical, interactive pathway for machine-assisted composition through graphical prompts and pixel-space diffusion.

Abstract

Recent years have witnessed significant progress in generative models for music, featuring diverse architectures that balance output quality, diversity, speed, and user control. This study explores a user-friendly graphical interface enabling the drawing of masked regions for inpainting by an Hourglass Diffusion Transformer (HDiT) model trained on MIDI piano roll images. To enhance note generation in specified areas, masked regions can be "repainted" with extra noise. The non-latent HDiTs linear scaling with pixel count allows efficient generation in pixel space, providing intuitive and interpretable controls such as masking throughout the network and removing the need to operate in compressed latent spaces such as those provided by pretrained autoencoders. We demonstrate that, in addition to inpainting of melodies, accompaniment, and continuations, the use of repainting can help increase note density yielding musical structures closely matching user specifications such as rising, falling, or diverging melody and/or accompaniment, even when these lie outside the typical training data distribution. We achieve performance on par with prior results while operating at longer context windows, with no autoencoder, and can enable complex geometries for inpainting masks, increasing the options for machine-assisted composers to control the generated music.

Pictures Of MIDI: Controlled Music Generation via Graphical Prompts for Image-Based Diffusion Inpainting

TL;DR

The paper addresses the challenge of giving users intuitive, shape-based control over generative music. It adopts a pixel-space diffusion approach using an Hourglass Diffusion Transformer (HDiT) to perform inpainting on MIDI piano-roll images, augmented by a RePaint mechanism to increase note density within users' masks. Key contributions include achieving comparable quality to prior image-driven diffusion methods (e.g., Polyffusion) while enabling longer contexts, eliminating the need for an autoencoder, and supporting complex, arbitrarily shaped inpainting regions with explicit note-velocity embedding via color borders. The results demonstrate standard and creative inpainting tasks, with improved control over density and structure, and extensive evaluation combining objective metrics and human listening tests. The approach offers a practical, interactive pathway for machine-assisted composition through graphical prompts and pixel-space diffusion.

Abstract

Recent years have witnessed significant progress in generative models for music, featuring diverse architectures that balance output quality, diversity, speed, and user control. This study explores a user-friendly graphical interface enabling the drawing of masked regions for inpainting by an Hourglass Diffusion Transformer (HDiT) model trained on MIDI piano roll images. To enhance note generation in specified areas, masked regions can be "repainted" with extra noise. The non-latent HDiTs linear scaling with pixel count allows efficient generation in pixel space, providing intuitive and interpretable controls such as masking throughout the network and removing the need to operate in compressed latent spaces such as those provided by pretrained autoencoders. We demonstrate that, in addition to inpainting of melodies, accompaniment, and continuations, the use of repainting can help increase note density yielding musical structures closely matching user specifications such as rising, falling, or diverging melody and/or accompaniment, even when these lie outside the typical training data distribution. We achieve performance on par with prior results while operating at longer context windows, with no autoencoder, and can enable complex geometries for inpainting masks, increasing the options for machine-assisted composers to control the generated music.
Paper Structure (12 sections, 1 equation, 10 figures, 1 table)

This paper contains 12 sections, 1 equation, 10 figures, 1 table.

Figures (10)

  • Figure 1: The Motivating Idea. Top: MIDI piano roll image of a sample "graphical prompt" of rough shapes (in blue) of pitches for melody generation given accompaniment (green lines). Bottom: Sample generated output.
  • Figure 2: Sample 512x128 MIDI piano roll image. Following Polyffusion polyffusion, we denote notes in green with onsets in red. We also add color-coded chord embeddings as borders along the top and bottom of the image. The right half of the image (after the dashed line) is "folded" underneath the left half to produce a square image suitable for example Hourglass Diffusion Transformer (HDiT) hdit pipelines. After generation, the images are restored to their rectangular format. (Although this "folding" causes a reversal of direction vs. of a simplw copy-paste, we do this so that information need not propagate all the way across the image to maintain musical continuity. In practice we observe no issues with continuity at fold boundary -- the model quickly learns to adapt.
  • Figure 3: Undirected generation. Here we see model generates various assortments of melody, accompaniment, and "chord borders". Refer to the demo website for listening examples.
  • Figure 4: Melody inpainting. The top portion of the piano roll is masked out in blue (top image), and the model generates melodies (shown in the bottom 3 images) that fit the accompaniment notes and the chords (denoted by colored bars along the top and bottom).
  • Figure 5: Accompaniment inpainting, given a melody. It is noteworthy that the chords shown in the top and bottom borders are the same for each output even though the accompaniment notes differ.
  • ...and 5 more figures