Table of Contents
Fetching ...

Mind-of-Director: Multi-modal Agent-Driven Film Previsualization via Collaborative Decision-Making

Shufeng Nan, Mengtian Li, Sixiao Zheng, Yuwei Lu, Han Zhang, Yanwei Fu

Abstract

We present Mind-of-Director, a multi-modal agent-driven framework for film previz that models the collaborative decision-making process of a film production team. Given a creative idea, Mind-of-Director orchestrates multiple specialized agents to produce previz sequences within the game engine. The framework consists of four cooperative modules: Script Development, where agents draft and refine the screenplay iteratively; Virtual Scene Design, which transforms text into semantically aligned 3D environments; Character Behaviour Control, which determines character blocking and motion; and Camera Planning, which optimizes framing, movement, and composition for cinematic camera effects. A real-time visual editing system built in the game engine further enables interactive inspection and synchronized timeline adjustment across scenes, behaviours, and cameras. Extensive experiments and human evaluations show that Mind-of-Director generates high-quality, semantically grounded previz sequences in approximately 25 minutes per idea, demonstrating the effectiveness of agent collaboration for both automated prototyping and human-in-the-loop filmmaking.

Mind-of-Director: Multi-modal Agent-Driven Film Previsualization via Collaborative Decision-Making

Abstract

We present Mind-of-Director, a multi-modal agent-driven framework for film previz that models the collaborative decision-making process of a film production team. Given a creative idea, Mind-of-Director orchestrates multiple specialized agents to produce previz sequences within the game engine. The framework consists of four cooperative modules: Script Development, where agents draft and refine the screenplay iteratively; Virtual Scene Design, which transforms text into semantically aligned 3D environments; Character Behaviour Control, which determines character blocking and motion; and Camera Planning, which optimizes framing, movement, and composition for cinematic camera effects. A real-time visual editing system built in the game engine further enables interactive inspection and synchronized timeline adjustment across scenes, behaviours, and cameras. Extensive experiments and human evaluations show that Mind-of-Director generates high-quality, semantically grounded previz sequences in approximately 25 minutes per idea, demonstrating the effectiveness of agent collaboration for both automated prototyping and human-in-the-loop filmmaking.
Paper Structure (17 sections, 5 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 17 sections, 5 equations, 4 figures, 1 table, 2 algorithms.

Figures (4)

  • Figure 1: Traditional Previz vs. Mind-of-Director.Traditional Previz requires iterative collaboration across multiple departments (typically iterations $N \gg 1$), involving script writing, 2D storyboarding, 3D scene construction, character blocking, animatic production, and camera planning. In contrast, Mind-of-Director automates this process ($N=1$) through multi-modal agents that collaborate in real-time decision-making to generate high-quality, semantically aligned, and visually coherent previz sequences directly from an idea, enabling a single creator to prototype cinematic scenes with minimal manual effort in the game engine.
  • Figure 2: Overview of the Mind-of-Director Framework. Given a high-level idea, our multi-modal agent-driven framework simulates a structured collaborative decision-making workflow through four interconnected modules: (1) Script Development refines the screenplay via a Discuss-Revise-Judge process; (2) Virtual Scene Design builds consistent 3D environments using 2D-guided and rule-based generation under spatial constraints; (3) Character Behaviour Control optimizes character blocking and motion through agent feedback; (4) Camera Planning selects and validates cinematic shots via a Debate-Judge-Validation loop for physical plausibility. All modules are integrated in Unity for real-time visualization and iterative refinement.
  • Figure 3: Qualitative Comparison. We present a representative sample from Act $A_i$ to demonstrate our framework's performance and cross-stage consistency. The image shows results across four stages: (a) Script Development: Comparison of screenplay generated by Solo vs. Agent Collaboration; (b) Virtual Scene Design: Comparison of scene layouts from StageDesigner and our approach with improved spatial grounding; (c) Character Behaviour Control: Character positioning from FilmAgent vs. our agent-driven method; (d) Camera Planning: Camera shot selection, comparing FilmAgent and our approach.
  • Figure 4: Unity-Based Interface. Our system provides synchronized timeline tracks for characters and cameras, enabling real-time inspection, editing, and visualization across all stages.