Table of Contents
Fetching ...

AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition

Minheng Ni, Chenfei Wu, Huaying Yuan, Zhengyuan Yang, Ming Gong, Lijuan Wang, Zicheng Liu, Wangmeng Zuo, Nan Duan

TL;DR

AutoDirector presents an interactive, GPT-4–based director agent that coordinates parallel scheduling of long multi-sensory film production tasks (scriptwriting, shooting, scoring, dubbing, and special effects) and continuously adapts to user feedback. By modeling production as events with dependencies and time-sliced progress reports, it computes planned and revoked tasks via $Q_t$ and $R_t$ and updates $p_{t+1}$ with $p_{t+1} = \phi(p_t; R_t, Q_t)$ in a looping cycle, enabling dynamic replanning. The system integrates emotion-aware dubbing, diffusion-based video synthesis, theme-aware music, and editing tools to produce cohesive outputs, demonstrated on a 1m18s case study, The General's Wedding. Experimental results show AutoDirector outperforms baselines in visual aesthetics, narrativity, and controllability, while achieving around a 40% efficiency gain through parallel scheduling and iterative user interaction. The work highlights the practical potential of AI-assisted directing for high-value, multi-sensory media production, balanced against substantial computational requirements and the need for resource-aware deployment.

Abstract

With the advancement of generative models, the synthesis of different sensory elements such as music, visuals, and speech has achieved significant realism. However, the approach to generate multi-sensory outputs has not been fully explored, limiting the application on high-value scenarios such as of directing a film. Developing a movie director agent faces two major challenges: (1) Lack of parallelism and online scheduling with production steps: In the production of multi-sensory films, there are complex dependencies between different sensory elements, and the production time for each element varies. (2) Diverse needs and clear communication demands with users: Users often cannot clearly express their needs until they see a draft, which requires human-computer interaction and iteration to continually adjust and optimize the film content based on user feedback. To address these issues, we introduce AutoDirector, an interactive multi-sensory composition framework that supports long shots, special effects, music scoring, dubbing, and lip-syncing. This framework improves the efficiency of multi-sensory film production through automatic scheduling and supports the modification and improvement of interactive tasks to meet user needs. AutoDirector not only expands the application scope of human-machine collaboration but also demonstrates the potential of AI in collaborating with humans in the role of a film director to complete multi-sensory films.

AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition

TL;DR

AutoDirector presents an interactive, GPT-4–based director agent that coordinates parallel scheduling of long multi-sensory film production tasks (scriptwriting, shooting, scoring, dubbing, and special effects) and continuously adapts to user feedback. By modeling production as events with dependencies and time-sliced progress reports, it computes planned and revoked tasks via and and updates with in a looping cycle, enabling dynamic replanning. The system integrates emotion-aware dubbing, diffusion-based video synthesis, theme-aware music, and editing tools to produce cohesive outputs, demonstrated on a 1m18s case study, The General's Wedding. Experimental results show AutoDirector outperforms baselines in visual aesthetics, narrativity, and controllability, while achieving around a 40% efficiency gain through parallel scheduling and iterative user interaction. The work highlights the practical potential of AI-assisted directing for high-value, multi-sensory media production, balanced against substantial computational requirements and the need for resource-aware deployment.

Abstract

With the advancement of generative models, the synthesis of different sensory elements such as music, visuals, and speech has achieved significant realism. However, the approach to generate multi-sensory outputs has not been fully explored, limiting the application on high-value scenarios such as of directing a film. Developing a movie director agent faces two major challenges: (1) Lack of parallelism and online scheduling with production steps: In the production of multi-sensory films, there are complex dependencies between different sensory elements, and the production time for each element varies. (2) Diverse needs and clear communication demands with users: Users often cannot clearly express their needs until they see a draft, which requires human-computer interaction and iteration to continually adjust and optimize the film content based on user feedback. To address these issues, we introduce AutoDirector, an interactive multi-sensory composition framework that supports long shots, special effects, music scoring, dubbing, and lip-syncing. This framework improves the efficiency of multi-sensory film production through automatic scheduling and supports the modification and improvement of interactive tasks to meet user needs. AutoDirector not only expands the application scope of human-machine collaboration but also demonstrates the potential of AI in collaborating with humans in the role of a film director to complete multi-sensory films.
Paper Structure (29 sections, 5 equations, 5 figures, 4 tables)

This paper contains 29 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Multi-sensory movie generation results of AutoDirector. The multi-sensory composition process includes scriptwriting, shooting, scoring, dubbing, and special effects. AutoDirector effectively integrates these elements to produce high-quality movies.
  • Figure 2: Overview of cognitive scheduling. Unlike traditional sequential execution, which is inefficient and unable to communicate user needs, our method can continuously organize and arrange tasks based on user comments and efficiently carry out movie creation through parallel execution.
  • Figure 3: The film production process of AutoDirector. Our process can be interpreted from two different perspectives: the task perspective and the time perspective. From the task perspective, film production consists of a series of tasks, and there is a sequence between the tasks. The AutoDirector manages all the tasks, and the user continuously puts forward their requirements during this process. From the time perspective, the AutoDirector will get the current progress report and user feedback at the beginning of each time segment and arrange new tasks, revoke completed tasks, or wait based on this until completion.
  • Figure 4: Comparison of different interactive types. The level of user feedback, ranging from Yes/No answers to Detailed Comments, progressively enhances the expressiveness of the picture, with a higher degree of participation leading to a more nuanced and emotionally impactful final product.
  • Figure 5: Movie production process involving a user comment. The process is dynamic, with decisions being made based on task process and user feedback, leading to adjustments and improvements in tasks such as dialogue generation.