MOVi: Training-free Text-conditioned Multi-Object Video Generation
Aimon Rahman, Jiang Liu, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Yusheng Su, Vishal M. Patel, Zicheng Liu, Emad Barsoum
TL;DR
MOVi tackles the challenge of generating videos with multiple objects conditioned only on text. It achieves this through a training-free pipeline that uses an LLM as a scene director to plan object trajectories and a noise reinitialization strategy to realize those motions, complemented by attention-based refinements to prevent cross-object interference. The approach yields significant gains in motion dynamics and object accuracy compared with strong baselines and commercial models, while maintaining high visual fidelity. The results demonstrate that open-world priors in diffusion models and LLMs can be effectively harnessed for scalable, flexible multi-object video generation without additional training.
Abstract
Recent advances in diffusion-based text-to-video (T2V) models have demonstrated remarkable progress, but these models still face challenges in generating videos with multiple objects. Most models struggle with accurately capturing complex object interactions, often treating some objects as static background elements and limiting their movement. In addition, they often fail to generate multiple distinct objects as specified in the prompt, resulting in incorrect generations or mixed features across objects. In this paper, we present a novel training-free approach for multi-object video generation that leverages the open world knowledge of diffusion models and large language models (LLMs). We use an LLM as the ``director'' of object trajectories, and apply the trajectories through noise re-initialization to achieve precise control of realistic movements. We further refine the generation process by manipulating the attention mechanism to better capture object-specific features and motion patterns, and prevent cross-object feature interference. Extensive experiments validate the effectiveness of our training free approach in significantly enhancing the multi-object generation capabilities of existing video diffusion models, resulting in 42% absolute improvement in motion dynamics and object generation accuracy, while also maintaining high fidelity and motion smoothness.
