Table of Contents
Fetching ...

LaunchpadGPT: Language Model as Music Visualization Designer on Launchpad

Siting Xu, Yolo Yunlong Tang, Feng Zheng

TL;DR

The paper addresses automating music visualization design for Launchpad by translating audio features into Launchpad lighting commands using a GPT-based model. It trains on prompt-completion pairs derived from 16 Launchpad-playing videos, mapping MFCC audio features to per-frame RGB-X button states via NanoGPT, and validates against random baselines with Fréchet Video Distance. Results show that LaunchpadGPT yields more human-like, coherent visualizations and reveals a correlation between audio amplitude and lighting intensity, suggesting practical utility for rapid visualization design in performances and installations. The approach enables broader applications in music visualization, such as game charts and lighting designs, while reducing design time and providing a scalable framework for audio-visual synchronization.

Abstract

Launchpad is a musical instrument that allows users to create and perform music by pressing illuminated buttons. To assist and inspire the design of the Launchpad light effect, and provide a more accessible approach for beginners to create music visualization with this instrument, we proposed the LaunchpadGPT model to generate music visualization designs on Launchpad automatically. Based on the language model with excellent generation ability, our proposed LaunchpadGPT takes an audio piece of music as input and outputs the lighting effects of Launchpad-playing in the form of a video (Launchpad-playing video). We collect Launchpad-playing videos and process them to obtain music and corresponding video frame of Launchpad-playing as prompt-completion pairs, to train the language model. The experiment result shows the proposed method can create better music visualization than random generation methods and hold the potential for a broader range of music visualization applications. Our code is available at https://github.com/yunlong10/LaunchpadGPT/.

LaunchpadGPT: Language Model as Music Visualization Designer on Launchpad

TL;DR

The paper addresses automating music visualization design for Launchpad by translating audio features into Launchpad lighting commands using a GPT-based model. It trains on prompt-completion pairs derived from 16 Launchpad-playing videos, mapping MFCC audio features to per-frame RGB-X button states via NanoGPT, and validates against random baselines with Fréchet Video Distance. Results show that LaunchpadGPT yields more human-like, coherent visualizations and reveals a correlation between audio amplitude and lighting intensity, suggesting practical utility for rapid visualization design in performances and installations. The approach enables broader applications in music visualization, such as game charts and lighting designs, while reducing design time and providing a scalable framework for audio-visual synchronization.

Abstract

Launchpad is a musical instrument that allows users to create and perform music by pressing illuminated buttons. To assist and inspire the design of the Launchpad light effect, and provide a more accessible approach for beginners to create music visualization with this instrument, we proposed the LaunchpadGPT model to generate music visualization designs on Launchpad automatically. Based on the language model with excellent generation ability, our proposed LaunchpadGPT takes an audio piece of music as input and outputs the lighting effects of Launchpad-playing in the form of a video (Launchpad-playing video). We collect Launchpad-playing videos and process them to obtain music and corresponding video frame of Launchpad-playing as prompt-completion pairs, to train the language model. The experiment result shows the proposed method can create better music visualization than random generation methods and hold the potential for a broader range of music visualization applications. Our code is available at https://github.com/yunlong10/LaunchpadGPT/.
Paper Structure (14 sections, 5 equations, 5 figures, 1 table)

This paper contains 14 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: This is an illustrative figure of a Launchpad with 8×8 illuminated buttons from Novation.
  • Figure 2: The framework of LaunchpadGPT. In the training phase, aligned music and frame information is extracted from the Launchpad-playing videos. MFCC features are extracted from the music, while the color-coordinate tuple (R, G, B, X) is obtained from the frames. The MFCC features to serve as the text prompt input for the language model, while RGB-X tuples are also tokenized as texts for "completion", training the language model in the teach-forcing paradigm. In the inference phase, the input music's features are extracted and transferred to text tokens, as prompt input for the trained language model, generating a series of (R, G, B, X) tuples as the "completion" output, and guiding the frame generation.
  • Figure 3: An example of prompt-completion pair. After the "prompt:" are the textual MFCC feature values, and after the "completion:" are the textual RGB-X tuples. The first RGB-X tuple (245, 5, 169, 1) denotes that the second button (the index of the first button is 0) on the Launchpad keyboard is purple.
  • Figure 4: The outputs generated by LaunchpadGPT, Random-RGB, Random-RGBX respectively.
  • Figure 5: Results visualization aligning audio frames with corresponding generated video frames