Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks
Ruibin Li, Tao Yang, Yangming Shi, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang
TL;DR
This work tackles the high cost of training dedicated text-to-video foundations by introducing Many-for-Many (MfM), a unified framework that trains a single model to handle diverse generation and manipulation tasks. MfM uses a lightweight adapter to unify 0D/1D/2D/3D conditioning, depth maps, and joint image-video learning within a Diffusion Transformer architecture that employs 3D full attention and 3D RoPE. A resolution-progressive, multi-task training regime with Flow Matching and on-the-fly depth prediction enables learning from both image and video data, yielding a model (8B) that achieves competitive or superior results across T2V, I2V, and several video manipulation tasks while using substantially less data than task-specific or large commercial models. The approach demonstrates strong cross-task knowledge transfer, improved video dynamics, and practical benefits for data-efficient, multi-task video generation. These findings suggest MfM as a versatile, scalable foundation for broad video generation and editing applications with reduced annotation and training costs.
Abstract
Diffusion models have shown impressive performance in many visual generation and manipulation tasks. Many existing methods focus on training a model for a specific task, especially, text-to-video (T2V) generation, while many other works focus on finetuning the pretrained T2V model for image-to-video (I2V), video-to-video (V2V), image and video manipulation tasks, etc. However, training a strong T2V foundation model requires a large amount of high-quality annotations, which is very costly. In addition, many existing models can perform only one or several tasks. In this work, we introduce a unified framework, namely many-for-many, which leverages the available training data from many different visual generation and manipulation tasks to train a single model for those different tasks. Specifically, we design a lightweight adapter to unify the different conditions in different tasks, then employ a joint image-video learning strategy to progressively train the model from scratch. Our joint learning leads to a unified visual generation and manipulation model with improved video generation performance. In addition, we introduce depth maps as a condition to help our model better perceive the 3D space in visual generation. Two versions of our model are trained with different model sizes (8B and 2B), each of which can perform more than 10 different tasks. In particular, our 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines. Our models and source codes are available at https://github.com/leeruibin/MfM.git.
