Table of Contents
Fetching ...

FilmComposer: LLM-Driven Music Production for Silent Film Clips

Zhifeng Xie, Qile He, Youjia Zhu, Qiwei He, Mengtian Li

TL;DR

This work tackles the challenge of producing high-quality, cinematically coherent film music for silent clips by introducing FilmComposer, an LLM-driven pipeline that imitates professional music workflows (spotting, composition, arrangement, mix). It fuses waveform and symbolic generation through three modules—visual processing, rhythm-controllable MusicGen, and a multi-agent assess/arrange/mix stage—to optimize audio quality, musicality, and development, while enabling user control. A dedicated dataset, MusicPro-7k, comprising ~7,418 film clips with descriptions, rhythms, and main melodies, underpins training and evaluation, complemented by novel metrics for musicality, development, and audiovisual alignment. Empirical results show state-of-the-art performance across quality, video correspondence, diversity, and musical development, with strong interactivity that supports seamless integration into real production pipelines and education.

Abstract

In this work, we implement music production for silent film clips using LLM-driven method. Given the strong professional demands of film music production, we propose the FilmComposer, simulating the actual workflows of professional musicians. FilmComposer is the first to combine large generative models with a multi-agent approach, leveraging the advantages of both waveform music and symbolic music generation. Additionally, FilmComposer is the first to focus on the three core elements of music production for film-audio quality, musicality, and musical development-and introduces various controls, such as rhythm, semantics, and visuals, to enhance these key aspects. Specifically, FilmComposer consists of the visual processing module, rhythm-controllable MusicGen, and multi-agent assessment, arrangement and mix. In addition, our framework can seamlessly integrate into the actual music production pipeline and allows user intervention in every step, providing strong interactivity and a high degree of creative freedom. Furthermore, we propose MusicPro-7k which includes 7,418 film clips, music, description, rhythm spots and main melody, considering the lack of a professional and high-quality film music dataset. Finally, both the standard metrics and the new specialized metrics we propose demonstrate that the music generated by our model achieves state-of-the-art performance in terms of quality, consistency with video, diversity, musicality, and musical development. Project page: https://apple-jun.github.io/FilmComposer.github.io/

FilmComposer: LLM-Driven Music Production for Silent Film Clips

TL;DR

This work tackles the challenge of producing high-quality, cinematically coherent film music for silent clips by introducing FilmComposer, an LLM-driven pipeline that imitates professional music workflows (spotting, composition, arrangement, mix). It fuses waveform and symbolic generation through three modules—visual processing, rhythm-controllable MusicGen, and a multi-agent assess/arrange/mix stage—to optimize audio quality, musicality, and development, while enabling user control. A dedicated dataset, MusicPro-7k, comprising ~7,418 film clips with descriptions, rhythms, and main melodies, underpins training and evaluation, complemented by novel metrics for musicality, development, and audiovisual alignment. Empirical results show state-of-the-art performance across quality, video correspondence, diversity, and musical development, with strong interactivity that supports seamless integration into real production pipelines and education.

Abstract

In this work, we implement music production for silent film clips using LLM-driven method. Given the strong professional demands of film music production, we propose the FilmComposer, simulating the actual workflows of professional musicians. FilmComposer is the first to combine large generative models with a multi-agent approach, leveraging the advantages of both waveform music and symbolic music generation. Additionally, FilmComposer is the first to focus on the three core elements of music production for film-audio quality, musicality, and musical development-and introduces various controls, such as rhythm, semantics, and visuals, to enhance these key aspects. Specifically, FilmComposer consists of the visual processing module, rhythm-controllable MusicGen, and multi-agent assessment, arrangement and mix. In addition, our framework can seamlessly integrate into the actual music production pipeline and allows user intervention in every step, providing strong interactivity and a high degree of creative freedom. Furthermore, we propose MusicPro-7k which includes 7,418 film clips, music, description, rhythm spots and main melody, considering the lack of a professional and high-quality film music dataset. Finally, both the standard metrics and the new specialized metrics we propose demonstrate that the music generated by our model achieves state-of-the-art performance in terms of quality, consistency with video, diversity, musicality, and musical development. Project page: https://apple-jun.github.io/FilmComposer.github.io/

Paper Structure

This paper contains 29 sections, 8 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: A schematic of our work. The left column illustrates the actual steps taken by human musicians in music production, while the middle column represents the corresponding simulated blocks in the FilmComposer. The inputs and outputs at each stage are depicted on the right, visually demonstrating how film clips are gradually transformed into the final music.
  • Figure 2: The framework of FilmComposer. Three large color blocks represent the three main modules, through which the input Film clips pass sequentially, ultimately outputting a waveform. The three blue blocks with musical notation illustrate the complete music production process, from setting the rhythm points, composing, to arranging and mixing.
  • Figure 3: The output scheme from the multi-agent arrangement and mix system, together with subsequent operation.
  • Figure 4: The structure and construction method of MusicPro-7k, which consists of film clips, description, music and rhythm spots.
  • Figure 5: Qualitative Comparison results on spectrograms. The yellow box indicates where the music generated by Video2Music, M2UGen and VidMuse exhibits incoherence. The blue boxes show that CMT, Video2Music and M2UGen are generally monotone, while VidMuse presents abrupt shifts. In contrast, the ground truth and FilmComposer demonstrate clear layering.
  • ...and 5 more figures