Table of Contents
Fetching ...

Mimir: Improving Video Diffusion Models for Precise Text Understanding

Shuai Tan, Biao Gong, Yutong Feng, Kecheng Zheng, Dandan Zheng, Shuwei Shi, Yujun Shen, Jingdong Chen, Ming Yang

TL;DR

This work tackles the challenge of limited text comprehension in text-to-video diffusion by integrating decoder-only large language models (LLMs) with conventional text encoders through a specialized Token Fuser. The framework harmonizes heterogeneous text representations via non-destructive fusion and a Semantic Stabilizer, enabling the diffusion model to leverage rich linguistic reasoning while preserving video priors. Empirical results on VBench show that Mimir improves semantic fidelity, spatial and temporal understanding, and overall video quality, with ablations underscoring the importance of normalization, Zero-Conv, and stabilization components. The approach demonstrates strong potential for precise, prompt-driven video synthesis and sets a path for more robust text-driven video generation using LLMs.

Abstract

Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of decoder-only transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser to harmonize the outputs from text encoders and LLMs. Such a design allows the T2V model to fully leverage learned video priors while capitalizing on the text-related capability of LLMs. Extensive quantitative and qualitative results demonstrate the effectiveness of Mimir in generating high-quality videos with excellent text comprehension, especially when processing short captions and managing shifting motions. Project page: https://lucaria-academy.github.io/Mimir/

Mimir: Improving Video Diffusion Models for Precise Text Understanding

TL;DR

This work tackles the challenge of limited text comprehension in text-to-video diffusion by integrating decoder-only large language models (LLMs) with conventional text encoders through a specialized Token Fuser. The framework harmonizes heterogeneous text representations via non-destructive fusion and a Semantic Stabilizer, enabling the diffusion model to leverage rich linguistic reasoning while preserving video priors. Empirical results on VBench show that Mimir improves semantic fidelity, spatial and temporal understanding, and overall video quality, with ablations underscoring the importance of normalization, Zero-Conv, and stabilization components. The approach demonstrates strong potential for precise, prompt-driven video synthesis and sets a path for more robust text-driven video generation using LLMs.

Abstract

Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of decoder-only transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser to harmonize the outputs from text encoders and LLMs. Such a design allows the T2V model to fully leverage learned video priors while capitalizing on the text-related capability of LLMs. Extensive quantitative and qualitative results demonstrate the effectiveness of Mimir in generating high-quality videos with excellent text comprehension, especially when processing short captions and managing shifting motions. Project page: https://lucaria-academy.github.io/Mimir/

Paper Structure

This paper contains 23 sections, 4 equations, 17 figures, 3 tables, 1 algorithm.

Figures (17)

  • Figure 1: Samples generated by Mimir. Our model demonstrates a powerful spatiotemporal imagination for input text prompts, e.g., (row-3) physically accurate petals, (row-4) the desert with illumination harmonization, which closely match human cognition.
  • Figure 2: The core idea of Mimir. Text Encoder is well suited for fine-tuning pre-trained T2V models ( ✓), however it struggles with limited text comprehension ( ✘). In contrast, Decoder-only LLM excels at precise text understanding ( ✓), but cannot be directly used in established video generation models since the feature distribution gap and the feature volatility ( ✘) . Therefore, we propose the token fuser in Mimir to harmonize multiple tokens, achieving precise text understanding ( ✓) in T2V generation ( ✓).
  • Figure 3: The framework of Mimir. Given a text prompt, we employ a text encoder and a decoder-only large language model to obtain $e_\theta$ and $e_\beta$. Additionally, we add an instruction prompt which, after processing by the decoder-only model, yields the corresponding instruction token $e_i$. See token details in Sec. \ref{['sec:tokens']}. To prevent any convergence issue in training caused by the feature distribution gap of $e_\theta$ and $e_\beta$, the proposed token fuser first applies a normalization layer and a learnable scale to $e_\beta$. It then uses Zero-Conv to preserve the original semantic space in the early of training. These modified tokens are then summed to produce $e \in \mathbb{R}^{n\times4096}$. Meanwhile, we initialize four learnable tokens $e_l$, which are added to $e_i$ to stabilize divergent semantic features. Finally, the token fuser concatenates $e$ and $e_s$ to generate videos.
  • Figure 4: Comparison between CogVideoX-5B with Mimir in T2V, where Mimir generates the vivid stunning moment of rocket launch.
  • Figure 5: Mimir demonstrates spatial comprehension and imagination, e.g., quantities, spatial relationships, colors, etc.
  • ...and 12 more figures