Table of Contents
Fetching ...

Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

Liu He, Yizhi Song, Hejun Huang, Pinxin Liu, Yunlong Tang, Daniel Aliaga, Xin Zhou

TL;DR

Kubrick presents a multimodal agent framework for synthetic video generation that combines LLM/VLM collaboration with Blender-based scripting to deliver physically plausible, storyboard-driven videos. The pipeline comprises three intertwined phases—Collaborative Generating, Visual Reviewing, and Library Updating—augmented by Retrieval-Augmented Generation to ground domain knowledge. Empirical results show Kubrick outperforms seven baselines across FVD, FID, CLIP-SIM, and user-study metrics, demonstrating improved motion realism, camera control, and prompt fidelity. This approach enables scalable, AI-assisted video production by bridging high-level narratives with detailed 3D engine scripting and iterative feedback loops.

Abstract

Text-to-video generation has been dominated by diffusion-based or autoregressive models. These novel models provide plausible versatility, but are criticized for improper physical motion, shading and illumination, camera motion, and temporal consistency. The film industry relies on manually-edited Computer-Generated Imagery (CGI) using 3D modeling software. Human-directed 3D synthetic videos address these shortcomings, but require tight collaboration between movie makers and 3D rendering experts. We introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations. Given a language description of a video, multiple VLM agents direct various processes of the generation pipeline. They cooperate to create Blender scripts which render a video following the given description. Augmented with Blender-based movie making knowledge, the Director agent decomposes the text-based video description into sub-processes. For each sub-process, the Programmer agent produces Python-based Blender scripts based on function composing and API calling. The Reviewer agent, with knowledge of video reviewing, character motion coordinates, and intermediate screenshots, provides feedback to the Programmer agent. The Programmer agent iteratively improves scripts to yield the best video outcome. Our generated videos show better quality than commercial video generation models in five metrics on video quality and instruction-following performance. Our framework outperforms other approaches in a user study on quality, consistency, and rationality.

Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

TL;DR

Kubrick presents a multimodal agent framework for synthetic video generation that combines LLM/VLM collaboration with Blender-based scripting to deliver physically plausible, storyboard-driven videos. The pipeline comprises three intertwined phases—Collaborative Generating, Visual Reviewing, and Library Updating—augmented by Retrieval-Augmented Generation to ground domain knowledge. Empirical results show Kubrick outperforms seven baselines across FVD, FID, CLIP-SIM, and user-study metrics, demonstrating improved motion realism, camera control, and prompt fidelity. This approach enables scalable, AI-assisted video production by bridging high-level narratives with detailed 3D engine scripting and iterative feedback loops.

Abstract

Text-to-video generation has been dominated by diffusion-based or autoregressive models. These novel models provide plausible versatility, but are criticized for improper physical motion, shading and illumination, camera motion, and temporal consistency. The film industry relies on manually-edited Computer-Generated Imagery (CGI) using 3D modeling software. Human-directed 3D synthetic videos address these shortcomings, but require tight collaboration between movie makers and 3D rendering experts. We introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations. Given a language description of a video, multiple VLM agents direct various processes of the generation pipeline. They cooperate to create Blender scripts which render a video following the given description. Augmented with Blender-based movie making knowledge, the Director agent decomposes the text-based video description into sub-processes. For each sub-process, the Programmer agent produces Python-based Blender scripts based on function composing and API calling. The Reviewer agent, with knowledge of video reviewing, character motion coordinates, and intermediate screenshots, provides feedback to the Programmer agent. The Programmer agent iteratively improves scripts to yield the best video outcome. Our generated videos show better quality than commercial video generation models in five metrics on video quality and instruction-following performance. Our framework outperforms other approaches in a user study on quality, consistency, and rationality.
Paper Structure (33 sections, 5 equations, 16 figures, 2 tables, 1 algorithm)

This paper contains 33 sections, 5 equations, 16 figures, 2 tables, 1 algorithm.

Figures (16)

  • Figure 1: More Generated Video Clips. The corresponding text prompts are provided below each video clip.
  • Figure 2: Kubrick Framework. Our video generation pipeline consists of multiple agent collaborations and iterative refinement of both generated visual results and code libraries. The LLM-Director decomposes the user-provided video description into functional sub-processes. Then, for each sub-process LLM-Programmer makes use of a set of function libraries to compose Python scripts for Blender. Intermediate screenshots and video outputs will be reviewed by VLM-Reviewer for visual quality and prompt-following evaluation. The review will be fed back to LLM-Programmer to iteratively improve the video. All three agents are improved by RAG based on public video tutorials and online documents.
  • Figure 2: More Qualitative Comparisons. Our method excels in handling character/object motions (highlighted in blue) and complex camera moves (highlighted in orange).
  • Figure 3: Collaborative Generation. The video description $q$ is decomposed by LLM-Director to five sub-processes of Mise-en-scène. Each sub-process handles specific tasks during synthetic video generation. Each sub-process description $\gamma_{i}$ guides scripting by LLM-Programmer to render corresponding intermediate visual output $v_{i}$.
  • Figure 3: Video-to-Video Style Transfer. We use AnimateDiff guo2023animatediff and ControlNet zhang2023adding (given both canny edge and depth) for each key frame output by our method. The converted video (lower row) inherits consistent character motion and smooth camera motion from the video generated by our method (upper row). It shows the potential of using our generated videos as prior for diversely-styled video generation. Note that the color inconsistency across key frames results from the randomness in the diffusion model.
  • ...and 11 more figures