SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data

Liqian Zhang; Magdalena Fuentes

SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data

Liqian Zhang, Magdalena Fuentes

TL;DR

SONIQUE addresses video-to-music generation without relying on paired audio-visual data by uniting Video-LLaMA-based video understanding, LLM-driven tag generation, and a CLAP-conditioned diffusion U-Net trained on royalty-free music. The system converts video semantics into concise musical prompts via tagging, enabling user control over instruments, genres, tempo, and melodies. Trained on about 2644 hours of music with LP-MusicCaps tagging and evaluated with both objective metrics (FDopenl3, KLpasst, CLAP) and a human study, SONIQUE achieves competitive results and demonstrates clear semantic alignment, though timing precision for longer clips remains challenging. The work offers an open-source, scalable path for customizable, royalty-friendly video background music generation using unpaired data.

Abstract

We present SONIQUE, a model for generating background music tailored to video content. Unlike traditional video-to-music generation approaches, which rely heavily on paired audio-visual datasets, SONIQUE leverages unpaired data, combining royalty-free music and independent video sources. By utilizing large language models (LLMs) for video understanding and converting visual descriptions into musical tags, alongside a U-Net-based conditional diffusion model, SONIQUE enables customizable music generation. Users can control specific aspects of the music, such as instruments, genres, tempo, and melodies, ensuring the generated output fits their creative vision. SONIQUE is open-source, with a demo available online.

SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data

TL;DR

Abstract

Paper Structure (5 sections, 5 figures)

This paper contains 5 sections, 5 figures.

Introduction
Related Work
SONIQUE
Evaluation
Conclusion

Figures (5)

Figure 1: Our proposed SONIQUE architecture. Semantic information is extracted from the input video using Video-LLaMA zhang2023videollama. This information is then passed through an LLM to generate simpler, descriptive tags. Then CLAP elizalde2022clap processes these tags and any additional user-provided tags for customization to produce prompt features. Finally, a diffusion U-Net uses these features to generate background music evans2024fast.
Figure 2: The process of generating music with SONIQUE. First, the input video (e.g., a cartoon scene from Zootopia) is analyzed to generate descriptive tags such as "Cyberpunk, Electronic, Futuristic, Synthwave, Upbeat, 120 BPM." Users can then fine-tune the music generation by providing additional prompts or specifying negative prompts. The final output is background music that matches both the video and user preferences.
Figure 3: In SONIQUE, tag generation for training starts by feeding raw musical data into LP-MusicCaps doh2023lpmusiccaps to generate initial captions. These captions are processed by Qwen 14B bai2023qwen in two steps: first, it converts the captions into tags, then it cleans the data by removing any incorrect or misleading tags (e.g., "Low Quality"). This results in a clean set of tags for training.
Figure 4: Quantitative results on MusicCaps. Other results are from evans2024fast.
Figure 5: Human evaluation overall score

SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data

TL;DR

Abstract

SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data

Authors

TL;DR

Abstract

Table of Contents

Figures (5)