Table of Contents
Fetching ...

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

Zichen Liu, Yihao Meng, Hao Ouyang, Yue Yu, Bolin Zhao, Daniel Cohen-Or, Huamin Qu

TL;DR

This work presents an automated text animation scheme, termed "Dynamic Typography", which deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts, and harnesses vector graphics representations and an end-to-end optimization-based framework.

Abstract

Text animation serves as an expressive medium, transforming static communication into dynamic experiences by infusing words with motion to evoke emotions, emphasize meanings, and construct compelling narratives. Crafting animations that are semantically aware poses significant challenges, demanding expertise in graphic design and animation. We present an automated text animation scheme, termed "Dynamic Typography", which combines two challenging tasks. It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts. Our technique harnesses vector graphics representations and an end-to-end optimization-based framework. This framework employs neural displacement fields to convert letters into base shapes and applies per-frame motion, encouraging coherence with the intended textual concept. Shape preservation techniques and perceptual loss regularization are employed to maintain legibility and structural integrity throughout the animation process. We demonstrate the generalizability of our approach across various text-to-video models and highlight the superiority of our end-to-end methodology over baseline methods, which might comprise separate tasks. Through quantitative and qualitative evaluations, we demonstrate the effectiveness of our framework in generating coherent text animations that faithfully interpret user prompts while maintaining readability. Our code is available at: https://animate-your-word.github.io/demo/.

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

TL;DR

This work presents an automated text animation scheme, termed "Dynamic Typography", which deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts, and harnesses vector graphics representations and an end-to-end optimization-based framework.

Abstract

Text animation serves as an expressive medium, transforming static communication into dynamic experiences by infusing words with motion to evoke emotions, emphasize meanings, and construct compelling narratives. Crafting animations that are semantically aware poses significant challenges, demanding expertise in graphic design and animation. We present an automated text animation scheme, termed "Dynamic Typography", which combines two challenging tasks. It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts. Our technique harnesses vector graphics representations and an end-to-end optimization-based framework. This framework employs neural displacement fields to convert letters into base shapes and applies per-frame motion, encouraging coherence with the intended textual concept. Shape preservation techniques and perceptual loss regularization are employed to maintain legibility and structural integrity throughout the animation process. We demonstrate the generalizability of our approach across various text-to-video models and highlight the superiority of our end-to-end methodology over baseline methods, which might comprise separate tasks. Through quantitative and qualitative evaluations, we demonstrate the effectiveness of our framework in generating coherent text animations that faithfully interpret user prompts while maintaining readability. Our code is available at: https://animate-your-word.github.io/demo/.
Paper Structure (24 sections, 10 equations, 17 figures, 2 tables)

This paper contains 24 sections, 10 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: An overview of the framework. Given a letter represented as a set of control points, the Base Field deforms it to the shared base shape, setting the stage to add per-frame displacement. Then the base shape is duplicated across $k$ frames, and the Motion Field predicts the displacement for each control point at each frame, infusing movement into the base shape. Each frame is rendered by the differentiable rasterizer $R$ and concatenated as the output video. The base and motion field are jointly optimized by the video prior from frozen pre-trained video foundation model using Score Distillation Sampling $\mathcal{L}_{\text{SDS}}$, under regularization on legibility $\mathcal{L}_{\text{legibility}}$ and structure preservation $\mathcal{L}_{\text{structure}}$.
  • Figure 2: Bézier curves representation of letter "B". The endpoints are marked in orange, and the inner control points are in blue.
  • Figure 3: Illustration of the prior knowledge conflict issue. The left is the deformed "R" for "BULLFIGHTER" with prompt "A bullfighter holds the corners of a red cape in both hands and waves it" generated by wordAI, the right is generated by livesketch to animate the deformed letter with the same prompt. The mismatch in prior knowledge between separate models leads to significant appearance changes and severe artifacts, as highlighted by the red circles.
  • Figure 4: Base shape of "Y" for "GYM" with prompt "A man doing exercise by lifting two dumbbells in both hands."
  • Figure 5: Adjacent frames of animation for letter "E" in "JET". A large area of alternating black and white "holes" occur within each frame, as highlighted within the red circles, causing severe flickering between the adjacent frames. (d) is the visualization of frame 1, highlighting the control points and the associated Bézier curves. The illustration reveals frequent intersections among the Bézier curves leading to the flickering artifacts.
  • ...and 12 more figures