SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data
Liqian Zhang, Magdalena Fuentes
TL;DR
SONIQUE addresses video-to-music generation without relying on paired audio-visual data by uniting Video-LLaMA-based video understanding, LLM-driven tag generation, and a CLAP-conditioned diffusion U-Net trained on royalty-free music. The system converts video semantics into concise musical prompts via tagging, enabling user control over instruments, genres, tempo, and melodies. Trained on about 2644 hours of music with LP-MusicCaps tagging and evaluated with both objective metrics (FDopenl3, KLpasst, CLAP) and a human study, SONIQUE achieves competitive results and demonstrates clear semantic alignment, though timing precision for longer clips remains challenging. The work offers an open-source, scalable path for customizable, royalty-friendly video background music generation using unpaired data.
Abstract
We present SONIQUE, a model for generating background music tailored to video content. Unlike traditional video-to-music generation approaches, which rely heavily on paired audio-visual datasets, SONIQUE leverages unpaired data, combining royalty-free music and independent video sources. By utilizing large language models (LLMs) for video understanding and converting visual descriptions into musical tags, alongside a U-Net-based conditional diffusion model, SONIQUE enables customizable music generation. Users can control specific aspects of the music, such as instruments, genres, tempo, and melodies, ensuring the generated output fits their creative vision. SONIQUE is open-source, with a demo available online.
