Table of Contents
Fetching ...

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu

TL;DR

GenMAC tackles the challenge of compositional text-to-video generation by decomposing complex prompts into structured scene layouts and employing an iterative, multi-agent workflow. The Redesign stage is further decomposed into four specialist agents and augmented with an adaptive self-routing mechanism that selects from multiple correction agents (consistency, temporal dynamics, spatial dynamics) to progressively align video output with the prompt. The Design stage provides layout guidance, while the Generation stage uses a layout-conditioned diffusion model to synthesize video, followed by iterative refinements through Redesign. Empirical results on the T2V-CompBench benchmark demonstrate state-of-the-art performance, with notable gains in generative numeracy and spatial-temporal fidelity, validating the benefits of task decomposition and agent specialization for compositional video generation.

Abstract

Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially-executed MLLM-based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state-of-the art performance in compositional text-to-video generation.

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

TL;DR

GenMAC tackles the challenge of compositional text-to-video generation by decomposing complex prompts into structured scene layouts and employing an iterative, multi-agent workflow. The Redesign stage is further decomposed into four specialist agents and augmented with an adaptive self-routing mechanism that selects from multiple correction agents (consistency, temporal dynamics, spatial dynamics) to progressively align video output with the prompt. The Design stage provides layout guidance, while the Generation stage uses a layout-conditioned diffusion model to synthesize video, followed by iterative refinements through Redesign. Empirical results on the T2V-CompBench benchmark demonstrate state-of-the-art performance, with notable gains in generative numeracy and spatial-temporal fidelity, validating the benefits of task decomposition and agent specialization for compositional video generation.

Abstract

Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially-executed MLLM-based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state-of-the art performance in compositional text-to-video generation.

Paper Structure

This paper contains 21 sections, 3 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Framework of GenMAC. Collaborative workflow includes three stages with an iterative loop: Design, Generation, and Redesign (\ref{['sec:workflow']}). Task decomposition decomposes the redesign stage into four sub-tasks, handled by four agents: verification agent, suggestion agent, correction agent, and output structuring agent (\ref{['sec:division']}). Self-routing mechanism allows for adaptive selection of suitable correction agent to address the diverse requirements for compositional text-to-video generation (\ref{['sec:self-routing']}).
  • Figure 2: Illustration of Task Decomposition for the Redesign stage (\ref{['sec:division']}). The diagram illustrates the allocation of roles: verification agent, suggestion agent, correction agent, and output structuring agent within a sequential task breakdown, highlighting the clear responsibilities of each agent.
  • Figure 3: Qualitative Comparison. Our proposed GenMAC generates videos that accurately adhere to complex compositional scenarios, demonstrating a clear advantage in handling such requirements in comparision with SOTA text-to-video models.
  • Figure 4: Qualitative Results. Our proposed GenMAC generates videos that highly aligned with complex compositional prompts, including attribute binding, multiple objects, quantity, and dynamic motion binding.
  • Figure 5: Visualization of the iterative refinement process in our multi-agent framework, demonstrating iterations enhance scene accuracy by progressively aligning video content with compositional prompts.
  • ...and 10 more figures