Table of Contents
Fetching ...

Mora: Enabling Generalist Video Generation via A Multi-Agent Framework

Zhengqing Yuan, Yixin Liu, Yihan Cao, Weixiang Sun, Haolong Jia, Ruoxi Chen, Zhaoxu Li, Bin Lin, Li Yuan, Lifang He, Chi Wang, Yanfang Ye, Lichao Sun

TL;DR

Mora presents a novel multi-agent framework for generalist text-to-video generation built on open-source modules, addressing coordination, data quality, and data efficiency through self-modulated fine-tuning, data-free training, and human-in-the-loop data filtering. The approach achieves competitive performance with OpenAI's Sora across six tasks and demonstrates strong capabilities in text-to-video and image-to-video generation, while maintaining an open-source pipeline. Key contributions include a formal multi-agent architecture, a modulation-based coordination mechanism, and a data-free loop augmented by LLM-guided data selection, collectively advancing open research in video synthesis. This work has practical implications for democratizing access to high-quality video generation and enabling flexible, collaborative AI systems for complex multimedia tasks.

Abstract

Text-to-video generation has made significant strides, but replicating the capabilities of advanced systems like OpenAI Sora remains challenging due to their closed-source nature. Existing open-source methods struggle to achieve comparable performance, often hindered by ineffective agent collaboration and inadequate training data quality. In this paper, we introduce Mora, a novel multi-agent framework that leverages existing open-source modules to replicate Sora functionalities. We address these fundamental limitations by proposing three key techniques: (1) multi-agent fine-tuning with a self-modulation factor to enhance inter-agent coordination, (2) a data-free training strategy that uses large models to synthesize training data, and (3) a human-in-the-loop mechanism combined with multimodal large language models for data filtering to ensure high-quality training datasets. Our comprehensive experiments on six video generation tasks demonstrate that Mora achieves performance comparable to Sora on VBench, outperforming existing open-source methods across various tasks. Specifically, in the text-to-video generation task, Mora achieved a Video Quality score of 0.800, surpassing Sora 0.797 and outperforming all other baseline models across six key metrics. Additionally, in the image-to-video generation task, Mora achieved a perfect Dynamic Degree score of 1.00, demonstrating exceptional capability in enhancing motion realism and achieving higher Imaging Quality than Sora. These results highlight the potential of collaborative multi-agent systems and human-in-the-loop mechanisms in advancing text-to-video generation. Our code is available at \url{https://github.com/lichao-sun/Mora}.

Mora: Enabling Generalist Video Generation via A Multi-Agent Framework

TL;DR

Mora presents a novel multi-agent framework for generalist text-to-video generation built on open-source modules, addressing coordination, data quality, and data efficiency through self-modulated fine-tuning, data-free training, and human-in-the-loop data filtering. The approach achieves competitive performance with OpenAI's Sora across six tasks and demonstrates strong capabilities in text-to-video and image-to-video generation, while maintaining an open-source pipeline. Key contributions include a formal multi-agent architecture, a modulation-based coordination mechanism, and a data-free loop augmented by LLM-guided data selection, collectively advancing open research in video synthesis. This work has practical implications for democratizing access to high-quality video generation and enabling flexible, collaborative AI systems for complex multimedia tasks.

Abstract

Text-to-video generation has made significant strides, but replicating the capabilities of advanced systems like OpenAI Sora remains challenging due to their closed-source nature. Existing open-source methods struggle to achieve comparable performance, often hindered by ineffective agent collaboration and inadequate training data quality. In this paper, we introduce Mora, a novel multi-agent framework that leverages existing open-source modules to replicate Sora functionalities. We address these fundamental limitations by proposing three key techniques: (1) multi-agent fine-tuning with a self-modulation factor to enhance inter-agent coordination, (2) a data-free training strategy that uses large models to synthesize training data, and (3) a human-in-the-loop mechanism combined with multimodal large language models for data filtering to ensure high-quality training datasets. Our comprehensive experiments on six video generation tasks demonstrate that Mora achieves performance comparable to Sora on VBench, outperforming existing open-source methods across various tasks. Specifically, in the text-to-video generation task, Mora achieved a Video Quality score of 0.800, surpassing Sora 0.797 and outperforming all other baseline models across six key metrics. Additionally, in the image-to-video generation task, Mora achieved a perfect Dynamic Degree score of 1.00, demonstrating exceptional capability in enhancing motion realism and achieving higher Imaging Quality than Sora. These results highlight the potential of collaborative multi-agent systems and human-in-the-loop mechanisms in advancing text-to-video generation. Our code is available at \url{https://github.com/lichao-sun/Mora}.
Paper Structure (35 sections, 3 equations, 15 figures, 7 tables, 1 algorithm)

This paper contains 35 sections, 3 equations, 15 figures, 7 tables, 1 algorithm.

Figures (15)

  • Figure 1: Illustration of SOPs to conduct video-related tasks in Mora.
  • Figure 2: Samples for text-to-video generation of Mora. Our approach can generate high-resolution, temporally consistent videos from text prompts. The samples shown are 480p resolution over 12 seconds duration at 276 frames in total.
  • Figure 3: Performance variations of Task 5 and Task 6 across different self-training iterations.
  • Figure 4: Performance variations of Task 1 to Task 4 across different self-training iterations.
  • Figure 5: An example of image generation process in Mora. Left: Agent uses the structured message to communicate, Right: After the prompt or image is generated, a human user can check the quality of the generated content.
  • ...and 10 more figures