Table of Contents
Fetching ...

SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing

Rong-Cheng Tu, Wenhao Sun, Zhao Jin, Jingyi Liao, Jiaxing Huang, Dacheng Tao

TL;DR

SPAgent presents a semantic planning framework that automatically coordinates a diverse tool library of open-source video generation and editing models to satisfy varied user intents. It decomposes the problem into decoupled intent recognition, principle-guided route planning, and capability-based model selection, and adds a video quality evaluation module to autonomously expand its toolkit. A manually annotated multi-task video dataset supports supervised fine-tuning. Experimental results show SPAgent achieves higher video quality, full task completion across text-to-video and image-to-video tasks, and effective autonomous library expansion, enabling adaptable general video generation and editing with reduced user burden.

Abstract

While open-source video generation and editing models have made significant progress, individual models are typically limited to specific tasks, failing to meet the diverse needs of users. Effectively coordinating these models can unlock a wide range of video generation and editing capabilities. However, manual coordination is complex and time-consuming, requiring users to deeply understand task requirements and possess comprehensive knowledge of each model's performance, applicability, and limitations, thereby increasing the barrier to entry. To address these challenges, we propose a novel video generation and editing system powered by our Semantic Planning Agent (SPAgent). SPAgent bridges the gap between diverse user intents and the effective utilization of existing generative models, enhancing the adaptability, efficiency, and overall quality of video generation and editing. Specifically, the SPAgent assembles a tool library integrating state-of-the-art open-source image and video generation and editing models as tools. After fine-tuning on our manually annotated dataset, SPAgent can automatically coordinate the tools for video generation and editing, through our novelly designed three-step framework: (1) decoupled intent recognition, (2) principle-guided route planning, and (3) capability-based execution model selection. Additionally, we enhance the SPAgent's video quality evaluation capability, enabling it to autonomously assess and incorporate new video generation and editing models into its tool library without human intervention. Experimental results demonstrate that the SPAgent effectively coordinates models to generate or edit videos, highlighting its versatility and adaptability across various video tasks.

SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing

TL;DR

SPAgent presents a semantic planning framework that automatically coordinates a diverse tool library of open-source video generation and editing models to satisfy varied user intents. It decomposes the problem into decoupled intent recognition, principle-guided route planning, and capability-based model selection, and adds a video quality evaluation module to autonomously expand its toolkit. A manually annotated multi-task video dataset supports supervised fine-tuning. Experimental results show SPAgent achieves higher video quality, full task completion across text-to-video and image-to-video tasks, and effective autonomous library expansion, enabling adaptable general video generation and editing with reduced user burden.

Abstract

While open-source video generation and editing models have made significant progress, individual models are typically limited to specific tasks, failing to meet the diverse needs of users. Effectively coordinating these models can unlock a wide range of video generation and editing capabilities. However, manual coordination is complex and time-consuming, requiring users to deeply understand task requirements and possess comprehensive knowledge of each model's performance, applicability, and limitations, thereby increasing the barrier to entry. To address these challenges, we propose a novel video generation and editing system powered by our Semantic Planning Agent (SPAgent). SPAgent bridges the gap between diverse user intents and the effective utilization of existing generative models, enhancing the adaptability, efficiency, and overall quality of video generation and editing. Specifically, the SPAgent assembles a tool library integrating state-of-the-art open-source image and video generation and editing models as tools. After fine-tuning on our manually annotated dataset, SPAgent can automatically coordinate the tools for video generation and editing, through our novelly designed three-step framework: (1) decoupled intent recognition, (2) principle-guided route planning, and (3) capability-based execution model selection. Additionally, we enhance the SPAgent's video quality evaluation capability, enabling it to autonomously assess and incorporate new video generation and editing models into its tool library without human intervention. Experimental results demonstrate that the SPAgent effectively coordinates models to generate or edit videos, highlighting its versatility and adaptability across various video tasks.

Paper Structure

This paper contains 21 sections, 3 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Illustration of our Semantic Planning Agent (SPAgent) for general video generation and editing. The SPAgent system is versatile and can adaptively handle a variety of video generation and editing tasks.
  • Figure 2: An illustration of our SPAgent. The MLLM agent acts as a coordinator. It elevates the multi-scenario video generative results by identifying the user intention, planning execution routes and models, and selecting the final output from all candidates.
  • Figure 3: Data statistics of our dataset in different categories.
  • Figure 4: Comparison of output videos generated by SPAgent with and without integrating CogVideoX into its tool library
  • Figure 5: Comparison of the output videos generated by different methods given various user input data types and requirements.
  • ...and 9 more figures