Generative Disco: Text-to-Video Generation for Music Visualization

Vivian Liu; Tao Long; Nathan Raw; Lydia Chilton

Generative Disco: Text-to-Video Generation for Music Visualization

Vivian Liu, Tao Long, Nathan Raw, Lydia Chilton

TL;DR

Generative Disco presents a novel workflow for music visualization that uses interval-based prompts to generate start/end images and interpolates between them in sync with audio. The system introduces two design patterns, holds and transitions, to structure visual narratives and maintain coherence while enabling expressive changes in color, time, subject, or style. A user study with 12 audiovisual professionals demonstrates high creativity support and low workload, highlighting the usefulness of brainstorming prompts (via GPT-4) and the balance between exploration and control. The work contributes a practical, generalizable framework for AI-assisted music visualization and points to future extensions in motion control, longer narratives, and open-source adoption. Overall, Generative Disco offers a scalable toolset for professionals to craft customized music visuals with reduced production effort and enhanced creative flexibility.

Abstract

Visuals can enhance our experience of music, owing to the way they can amplify the emotions and messages conveyed within it. However, creating music visualization is a complex, time-consuming, and resource-intensive process. We introduce Generative Disco, a generative AI system that helps generate music visualizations with large language models and text-to-video generation. The system helps users visualize music in intervals by finding prompts to describe the images that intervals start and end on and interpolating between them to the beat of the music. We introduce design patterns for improving these generated videos: transitions, which express shifts in color, time, subject, or style, and holds, which help focus the video on subjects. A study with professionals showed that transitions and holds were a highly expressive framework that enabled them to build coherent visual narratives. We conclude on the generalizability of these patterns and the potential of generated video for creative professionals.

Generative Disco: Text-to-Video Generation for Music Visualization

TL;DR

Abstract

Paper Structure (28 sections, 9 figures, 1 table)

This paper contains 28 sections, 9 figures, 1 table.

Introduction
Related Work
Music Visualization
Video Creation and Editing
Generative AI in Creative Workflows
Analysis of Music Visualization
Designing with Generative Disco
The Generative Disco interface
System Implementation
Evaluation
Experimental Design
Results
Quantitative Feedback
Creativity Support Index Metrics
NASA-TLX Metrics
...and 13 more sections

Figures (9)

Figure 1: A list of six major audio-visual alignment patterns that we found in animated music videos. The patterns that we arrived at were 1) images symbolizing the lyric 2) cutting on the beat or around a lyric 3) concurrent visual and audio changes 4) visual consistency 5) singing 6) choreography. Examples are illustrated with popular music videos (artists labeled inline).
Figure 2: Generative Disco system design. Users begin by interacting with the waveform to create intervals within the music (A). To find prompts that will define the start and end of intervals (C), users can brainstorm prompts using suggestions from GPT-4 (B, D) and explore text-to-image generations (E, I). Results users like can be dragged and dropped into the start and end areas (C), after which an interval can be generated. Generated intervals show in the Tracks (G) and can be stitched into a video placed in the Video Area (H).
Figure 3: Distribution of Likert scale responses on NASA-TLX, creativity support index, and workflow-specific questions across all participants for the music visualization task. Full questions are in the Supplemental Material.
Figure 4: Examples of different ways participants could chunk their music into intervals. P3 chunked their music around lyrics. P4 interacted with more non-lyrical elements like slides, bass, and beats. P2 captured a drop in their metalcore song. At interval boundaries, a fork between two images indicates that participants made a cut between the two images.
Figure 5: Frame-by-frame view of music visualization by P10. Start and end prompts are displayed above the intervals. The first and last images of each track correspond to these prompts. Intervals 1 and 3 show the design pattern of a hold. Intervals 1, 4, and 5 show color and subject transitions.
...and 4 more figures

Generative Disco: Text-to-Video Generation for Music Visualization

TL;DR

Abstract

Generative Disco: Text-to-Video Generation for Music Visualization

Authors

TL;DR

Abstract

Table of Contents

Figures (9)