Table of Contents
Fetching ...

Text-Animator: Controllable Visual Text Video Generation

Lin Liu, Quande Liu, Shengju Qian, Yuan Zhou, Wengang Zhou, Houqiang Li, Lingxi Xie, Qi Tian

TL;DR

Quantitative and qualitative experimental results demonstrate the superiority of the proposed Text-Animator approach to the accuracy of generated visual text over state-of-the-art video generation methods.

Abstract

Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the progress achieved in Text-to-Video~(T2V) generation, current methods still cannot effectively visualize texts in videos directly, as they mainly focus on summarizing semantic scene information, understanding, and depicting actions. While recent advances in image-level visual text generation show promise, transitioning these techniques into the video domain faces problems, notably in preserving textual fidelity and motion coherence. In this paper, we propose an innovative approach termed Text-Animator for visual text video generation. Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos. Besides, we develop a camera control module and a text refinement module to improve the stability of generated visual text by controlling the camera movement as well as the motion of visualized text. Quantitative and qualitative experimental results demonstrate the superiority of our approach to the accuracy of generated visual text over state-of-the-art video generation methods. The project page can be found at https://laulampaul.github.io/text-animator.html.

Text-Animator: Controllable Visual Text Video Generation

TL;DR

Quantitative and qualitative experimental results demonstrate the superiority of the proposed Text-Animator approach to the accuracy of generated visual text over state-of-the-art video generation methods.

Abstract

Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the progress achieved in Text-to-Video~(T2V) generation, current methods still cannot effectively visualize texts in videos directly, as they mainly focus on summarizing semantic scene information, understanding, and depicting actions. While recent advances in image-level visual text generation show promise, transitioning these techniques into the video domain faces problems, notably in preserving textual fidelity and motion coherence. In this paper, we propose an innovative approach termed Text-Animator for visual text video generation. Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos. Besides, we develop a camera control module and a text refinement module to improve the stability of generated visual text by controlling the camera movement as well as the motion of visualized text. Quantitative and qualitative experimental results demonstrate the superiority of our approach to the accuracy of generated visual text over state-of-the-art video generation methods. The project page can be found at https://laulampaul.github.io/text-animator.html.
Paper Structure (17 sections, 1 equation, 7 figures, 3 tables)

This paper contains 17 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Given a sentence with visualized words, our Text-Animator is able to produce a wide range of videos that not only show the semantic information of given text prompts, but further align with the visualized words. Our method is a one stage method without further tuning.
  • Figure 2: Framework of Text-Animator. Given a pre-trained 3D-UNet, the camera ControlNet takes camera embedding as input and outputs camera representations; the text and position ControlNet takes the combination feature $z_{c}$ as input and outputs position representations These features are then integrated into the 2D Conv layers and temporal attention layers of 3D-UNet at their respective scales.
  • Figure 3: Qualitative comparison of Text-Animator and state-of-the-art T2V models or APIs in visual text generation. The prompt is 'A red panda is holding a sign that says 'HELLO".
  • Figure 4: Qualitative comparison of Text-Animator and the combination of state-of-the-art T2I visual text generation models (GpyphControl and Anytext) and I2V models (AnimateLCM wang2024animatelcm, I2VGen-XL zhang2023i2vgen, and SVD). The prompt is 'A girl wearing a blue T-shirt with the words 'BEAUTY', slight smile, seaside background'.
  • Figure 5: Qualitative comparison of Text-Animator and others on one example of the LAION-subset dataset. The prompt is 'Two bags with the word 'CHRISTMAS' designed on it'. Other methods cannot generate the correct word (Please zoom to see the results).
  • ...and 2 more figures