Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation
Liu He, Yizhi Song, Hejun Huang, Pinxin Liu, Yunlong Tang, Daniel Aliaga, Xin Zhou
TL;DR
Kubrick presents a multimodal agent framework for synthetic video generation that combines LLM/VLM collaboration with Blender-based scripting to deliver physically plausible, storyboard-driven videos. The pipeline comprises three intertwined phases—Collaborative Generating, Visual Reviewing, and Library Updating—augmented by Retrieval-Augmented Generation to ground domain knowledge. Empirical results show Kubrick outperforms seven baselines across FVD, FID, CLIP-SIM, and user-study metrics, demonstrating improved motion realism, camera control, and prompt fidelity. This approach enables scalable, AI-assisted video production by bridging high-level narratives with detailed 3D engine scripting and iterative feedback loops.
Abstract
Text-to-video generation has been dominated by diffusion-based or autoregressive models. These novel models provide plausible versatility, but are criticized for improper physical motion, shading and illumination, camera motion, and temporal consistency. The film industry relies on manually-edited Computer-Generated Imagery (CGI) using 3D modeling software. Human-directed 3D synthetic videos address these shortcomings, but require tight collaboration between movie makers and 3D rendering experts. We introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations. Given a language description of a video, multiple VLM agents direct various processes of the generation pipeline. They cooperate to create Blender scripts which render a video following the given description. Augmented with Blender-based movie making knowledge, the Director agent decomposes the text-based video description into sub-processes. For each sub-process, the Programmer agent produces Python-based Blender scripts based on function composing and API calling. The Reviewer agent, with knowledge of video reviewing, character motion coordinates, and intermediate screenshots, provides feedback to the Programmer agent. The Programmer agent iteratively improves scripts to yield the best video outcome. Our generated videos show better quality than commercial video generation models in five metrics on video quality and instruction-following performance. Our framework outperforms other approaches in a user study on quality, consistency, and rationality.
