Table of Contents
Fetching ...

MOVi: Training-free Text-conditioned Multi-Object Video Generation

Aimon Rahman, Jiang Liu, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Yusheng Su, Vishal M. Patel, Zicheng Liu, Emad Barsoum

TL;DR

MOVi tackles the challenge of generating videos with multiple objects conditioned only on text. It achieves this through a training-free pipeline that uses an LLM as a scene director to plan object trajectories and a noise reinitialization strategy to realize those motions, complemented by attention-based refinements to prevent cross-object interference. The approach yields significant gains in motion dynamics and object accuracy compared with strong baselines and commercial models, while maintaining high visual fidelity. The results demonstrate that open-world priors in diffusion models and LLMs can be effectively harnessed for scalable, flexible multi-object video generation without additional training.

Abstract

Recent advances in diffusion-based text-to-video (T2V) models have demonstrated remarkable progress, but these models still face challenges in generating videos with multiple objects. Most models struggle with accurately capturing complex object interactions, often treating some objects as static background elements and limiting their movement. In addition, they often fail to generate multiple distinct objects as specified in the prompt, resulting in incorrect generations or mixed features across objects. In this paper, we present a novel training-free approach for multi-object video generation that leverages the open world knowledge of diffusion models and large language models (LLMs). We use an LLM as the ``director'' of object trajectories, and apply the trajectories through noise re-initialization to achieve precise control of realistic movements. We further refine the generation process by manipulating the attention mechanism to better capture object-specific features and motion patterns, and prevent cross-object feature interference. Extensive experiments validate the effectiveness of our training free approach in significantly enhancing the multi-object generation capabilities of existing video diffusion models, resulting in 42% absolute improvement in motion dynamics and object generation accuracy, while also maintaining high fidelity and motion smoothness.

MOVi: Training-free Text-conditioned Multi-Object Video Generation

TL;DR

MOVi tackles the challenge of generating videos with multiple objects conditioned only on text. It achieves this through a training-free pipeline that uses an LLM as a scene director to plan object trajectories and a noise reinitialization strategy to realize those motions, complemented by attention-based refinements to prevent cross-object interference. The approach yields significant gains in motion dynamics and object accuracy compared with strong baselines and commercial models, while maintaining high visual fidelity. The results demonstrate that open-world priors in diffusion models and LLMs can be effectively harnessed for scalable, flexible multi-object video generation without additional training.

Abstract

Recent advances in diffusion-based text-to-video (T2V) models have demonstrated remarkable progress, but these models still face challenges in generating videos with multiple objects. Most models struggle with accurately capturing complex object interactions, often treating some objects as static background elements and limiting their movement. In addition, they often fail to generate multiple distinct objects as specified in the prompt, resulting in incorrect generations or mixed features across objects. In this paper, we present a novel training-free approach for multi-object video generation that leverages the open world knowledge of diffusion models and large language models (LLMs). We use an LLM as the ``director'' of object trajectories, and apply the trajectories through noise re-initialization to achieve precise control of realistic movements. We further refine the generation process by manipulating the attention mechanism to better capture object-specific features and motion patterns, and prevent cross-object feature interference. Extensive experiments validate the effectiveness of our training free approach in significantly enhancing the multi-object generation capabilities of existing video diffusion models, resulting in 42% absolute improvement in motion dynamics and object generation accuracy, while also maintaining high fidelity and motion smoothness.

Paper Structure

This paper contains 23 sections, 7 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Results of multiple-object video generation: Baseline chen2024videocrafter2 and other open-source state-of-the-art models yang2024cogvideoxvchitectxing2024dynamicrafter often struggle to generate multiple objects simultaneously. These models frequently prioritize the first object in the prompt or merge multiple objects into a single entity.
  • Figure 2: Pipeline of the proposed MOVi framework. SA stands for the self-attention. In the first stage, an LLM acts as a director, generating object trajectories from the input prompt and specified number of frames. These object trajectories are then used to reinitialize the noise, with masking applied based on the noise's low and high-frequency components. The noise is passed through the network to generate videos with multiple objects. During the iterative denoising process, attention re-weighting is applied to specific bounding boxes based on the prompt to eliminate unwanted object influences and refine the output.
  • Figure 3: Video frames generated by the prompt "A monkey and a squirrel on a tree." Top row shows the results without attention re-weighting, where features are mixed—e.g., the squirrel has the face of a monkey, and the monkey has a squirrel's tail. Bottom row shows the improved results after applying attention re-weighting, where both the monkey and squirrel retain their correct, distinct features.
  • Figure 4: Multi-object generation quality and accuracy. We report a) FVD and b) object generation accuracy as the number of objects increases and c) varied objects. Please refer to \ref{['sec:results']} for details.
  • Figure 5: Qualitative comparison between MOVi and state-of-the-art T2V models given the prompt "a cat and a dog playing." Notably, existing methods struggle to generate two distinct objects, often producing either only cats or a combined and indistinct shape of both animals. In contrast, MOVi successfully generates both the cat and the dog as separate and well-defined objects.
  • ...and 4 more figures