Table of Contents
Fetching ...

Harmonizing Pixels and Melodies: Maestro-Guided Film Score Generation and Composition Style Transfer

F. Qi, L. Ni, C. Xu

TL;DR

A film score generation framework to harmonize visual pixels and music melodies utilizing a latent diffusion model, capable of generating music that reflects the guidance of a maestro's style, thereby redefining the benchmark for automated film scores and laying a robust groundwork for future research in this domain.

Abstract

We introduce a film score generation framework to harmonize visual pixels and music melodies utilizing a latent diffusion model. Our framework processes film clips as input and generates music that aligns with a general theme while offering the capability to tailor outputs to a specific composition style. Our model directly produces music from video, utilizing a streamlined and efficient tuning mechanism on ControlNet. It also integrates a film encoder adept at understanding the film's semantic depth, emotional impact, and aesthetic appeal. Additionally, we introduce a novel, effective yet straightforward evaluation metric to evaluate the originality and recognizability of music within film scores. To fill this gap for film scores, we curate a comprehensive dataset of film videos and legendary original scores, injecting domain-specific knowledge into our data-driven generation model. Our model outperforms existing methodologies in creating film scores, capable of generating music that reflects the guidance of a maestro's style, thereby redefining the benchmark for automated film scores and laying a robust groundwork for future research in this domain. The code and generated samples are available at https://anonymous.4open.science/r/HPM.

Harmonizing Pixels and Melodies: Maestro-Guided Film Score Generation and Composition Style Transfer

TL;DR

A film score generation framework to harmonize visual pixels and music melodies utilizing a latent diffusion model, capable of generating music that reflects the guidance of a maestro's style, thereby redefining the benchmark for automated film scores and laying a robust groundwork for future research in this domain.

Abstract

We introduce a film score generation framework to harmonize visual pixels and music melodies utilizing a latent diffusion model. Our framework processes film clips as input and generates music that aligns with a general theme while offering the capability to tailor outputs to a specific composition style. Our model directly produces music from video, utilizing a streamlined and efficient tuning mechanism on ControlNet. It also integrates a film encoder adept at understanding the film's semantic depth, emotional impact, and aesthetic appeal. Additionally, we introduce a novel, effective yet straightforward evaluation metric to evaluate the originality and recognizability of music within film scores. To fill this gap for film scores, we curate a comprehensive dataset of film videos and legendary original scores, injecting domain-specific knowledge into our data-driven generation model. Our model outperforms existing methodologies in creating film scores, capable of generating music that reflects the guidance of a maestro's style, thereby redefining the benchmark for automated film scores and laying a robust groundwork for future research in this domain. The code and generated samples are available at https://anonymous.4open.science/r/HPM.

Paper Structure

This paper contains 36 sections, 9 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Illustration of our HPM framework. a) During the training stage, our model incorporates the video feature as a global control input, alongside the local control signal of melody and dynamics. b) During the inference stage, the local controls can be one composition style of one specific master, guiding the Film Score ControlNet to produce a Mel-spectrogram, subsequently converted into audio via a vocoder. c) The film encoder processes to extract emotional, semantic, and aesthetic embeddings, enriching the model's interpretative depth.
  • Figure 2: Originality vs. Recognizability Comparison. The big data point represents a single model, encapsulating the average scores of originality and recognizability across all categories, while the surrounding smaller data points illustrate the scores within each category for that model.
  • Figure 3: Examples of our Film Score Generation, including video frame (top), emotion & aesthetic score (middle), and generated melody & dynamic (bottom).
  • Figure 4: Examples of our composition style transfer, including spectrogram (top), melody (middle), and dynamic (bottom).
  • Figure 5: Top 20 composers data distribution
  • ...and 5 more figures