Table of Contents
Fetching ...

ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement

Zhefan Rao, Liya Ji, Yazhou Xing, Runtao Liu, Zhaoyang Liu, Jiaxin Xie, Ziqiao Peng, Yingqing He, Qifeng Chen

TL;DR

This paper tackles the high cost and data needs of text-to-video training by proposing ModelGrow, a continual general pre-training framework that expands model capacity and enhances language understanding. It introduces block-duplicated transformer expansion (insert/prefix/suffix variants, with zero initialization) to mitigate forgetting while growing knowledge, and a two-condition LLM-enhanced language pathway through an extra cross-attention block and a richer text encoder that leverages long prompts. The approach is validated on Open-Sora as base, with continual pretraining on a long-prompt video dataset and evaluation against VBench and CompBench, showing improved quality and semantic alignment, especially with LLM embeddings and detailed re-captioning. The results suggest a scalable route to more capable T2V models under limited resources, with released code and models enabling broader adoption and further research.

Abstract

Text-to-video (T2V) generation has gained significant attention recently. However, the costs of training a T2V model from scratch remain persistently high, and there is considerable room for improving the generation performance, especially under limited computation resources. This work explores the continual general pre-training of text-to-video models, enabling the model to "grow" its abilities based on a pre-trained foundation, analogous to how humans acquire new knowledge based on past experiences. There is a lack of extensive study of the continual pre-training techniques in T2V generation. In this work, we take the initial step toward exploring this task systematically and propose ModelGrow. Specifically, we break this task into two key aspects: increasing model capacity and improving semantic understanding. For model capacity, we introduce several novel techniques to expand the model size, enabling it to store new knowledge and improve generation performance. For semantic understanding, we propose a method that leverages large language models as advanced text encoders, integrating them into T2V models to enhance language comprehension and guide generation results according to detailed prompts. This approach enables the model to achieve better semantic alignment, particularly in response to complex user prompts. Extensive experiments demonstrate the effectiveness of our method across various metrics. The source code and the model of ModelGrow will be publicly available.

ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement

TL;DR

This paper tackles the high cost and data needs of text-to-video training by proposing ModelGrow, a continual general pre-training framework that expands model capacity and enhances language understanding. It introduces block-duplicated transformer expansion (insert/prefix/suffix variants, with zero initialization) to mitigate forgetting while growing knowledge, and a two-condition LLM-enhanced language pathway through an extra cross-attention block and a richer text encoder that leverages long prompts. The approach is validated on Open-Sora as base, with continual pretraining on a long-prompt video dataset and evaluation against VBench and CompBench, showing improved quality and semantic alignment, especially with LLM embeddings and detailed re-captioning. The results suggest a scalable route to more capable T2V models under limited resources, with released code and models enabling broader adoption and further research.

Abstract

Text-to-video (T2V) generation has gained significant attention recently. However, the costs of training a T2V model from scratch remain persistently high, and there is considerable room for improving the generation performance, especially under limited computation resources. This work explores the continual general pre-training of text-to-video models, enabling the model to "grow" its abilities based on a pre-trained foundation, analogous to how humans acquire new knowledge based on past experiences. There is a lack of extensive study of the continual pre-training techniques in T2V generation. In this work, we take the initial step toward exploring this task systematically and propose ModelGrow. Specifically, we break this task into two key aspects: increasing model capacity and improving semantic understanding. For model capacity, we introduce several novel techniques to expand the model size, enabling it to store new knowledge and improve generation performance. For semantic understanding, we propose a method that leverages large language models as advanced text encoders, integrating them into T2V models to enhance language comprehension and guide generation results according to detailed prompts. This approach enables the model to achieve better semantic alignment, particularly in response to complex user prompts. Extensive experiments demonstrate the effectiveness of our method across various metrics. The source code and the model of ModelGrow will be publicly available.

Paper Structure

This paper contains 36 sections, 3 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: We continue the pre-training of a text-to-video diffusion model with ModelGrow, which includes model expansion and language understanding enhancement. Our proposed ModelGrow enhances visual quality, content richness, motion quality, and the ability to prompt following. For each example, we present three keyframes along with a video. Play the video by clicking it with Adobe Acrobat.
  • Figure 2: Simplified forward pipeline and variants of block expansion methods. (a) The vanilla pipeline is the traditional stacking of transformer blocks for sequential processing; (b) insert stacking is inserting new transformer blocks intermittently between the existing stack; (c) prefix stacking is adding all the new transformer blocks at the beginning of the stack; (d) suffix stacking is appending all the new transformer blocks at the end of the stack. Each stacking variant illustrates different strategies for arranging transformer blocks to enhance model performance. We choose Insert Stacking as our expansion method.
  • Figure 3: Overview of the pipeline cooperating with the LLMs enhancement. We modify the architecture of the transformer block by adding another cross-attention block, aiming to learn the condition of LLM text embedding. The LLMs cross-attention block follows the original T5 cross-attention block to enhance the language understanding ability of the generation models. For better understanding, we omit the details of the temporal block and spatial block, which are the same as the DiT peebles2023scalable transformer block. All parameters of the transformer block will be updated during the training process.
  • Figure 4: Qualitative results of our results compared with baselines. Our model Expansion-1.4B-LLM generates videos with higher quality and more semantic alignment than the results of baselines, given the prompt. Play the video by clicking it with Adobe Acrobat.
  • Figure 5: Examples of our model with different prompts. We use Expansion-1.4B-LLM to evaluate the effectiveness of modified recaptioning. We can see the content of the video is richer with the help of the long prompt (L). However, some key information, such as "blue bath towel", is missing if we directly replace the original prompt (S) with the long prompt. By concatenating the original prompt with the long prompt (SL), the model could keep the critical information as well as lead to a richer video result. Play the video by clicking it with Adobe Acrobat.
  • ...and 9 more figures