Kinetic Typography Diffusion Model

Seonmi Park; Inhwan Bae; Seunghyun Shin; Hae-Gon Jeon

Kinetic Typography Diffusion Model

Seonmi Park, Inhwan Bae, Seunghyun Shin, Hae-Gon Jeon

TL;DR

This work tackles realistic, user-driven kinetic typography by introducing the KineTy diffusion framework, which generates animatable multi-letter text from prompts. A large-scale KineTy dataset of about $600{,}000$ videos is built from $584$ templates and labeled with static and dynamic captions to guide appearance and motion, plus ground-truth sequences for fair evaluation. The model employs separate spatial and temporal guidance via static/dynamic captions, a zero-convolution word-conditioning branch to preserve text content, and a glyph loss that enforces legible per-letter rendering, with the overall objective $L = L_{ldm} + \alpha L_{glyph}$. Experimental results show that KineTy achieves superior alignment with captions and more legible, aesthetically pleasing letter motions across multiple metrics (e.g., $FVD$, $CLIP$, $OCR$) and user studies corroborate practical usefulness for kinetic typography creation. This approach enables editable, high-quality kinetic typography from text prompts, potentially transforming workflow efficiency in motion graphics and related applications.

Abstract

This paper introduces a method for realistic kinetic typography that generates user-preferred animatable 'text content'. We draw on recent advances in guided video diffusion models to achieve visually-pleasing text appearances. To do this, we first construct a kinetic typography dataset, comprising about 600K videos. Our dataset is made from a variety of combinations in 584 templates designed by professional motion graphics designers and involves changing each letter's position, glyph, and size (i.e., flying, glitches, chromatic aberration, reflecting effects, etc.). Next, we propose a video diffusion model for kinetic typography. For this, there are three requirements: aesthetic appearances, motion effects, and readable letters. This paper identifies the requirements. For this, we present static and dynamic captions used as spatial and temporal guidance of a video diffusion model, respectively. The static caption describes the overall appearance of the video, such as colors, texture and glyph which represent a shape of each letter. The dynamic caption accounts for the movements of letters and backgrounds. We add one more guidance with zero convolution to determine which text content should be visible in the video. We apply the zero convolution to the text content, and impose it on the diffusion model. Lastly, our glyph loss, only minimizing a difference between the predicted word and its ground-truth, is proposed to make the prediction letters readable. Experiments show that our model generates kinetic typography videos with legible and artistic letter motions based on text prompts.

Kinetic Typography Diffusion Model

TL;DR

videos is built from

templates and labeled with static and dynamic captions to guide appearance and motion, plus ground-truth sequences for fair evaluation. The model employs separate spatial and temporal guidance via static/dynamic captions, a zero-convolution word-conditioning branch to preserve text content, and a glyph loss that enforces legible per-letter rendering, with the overall objective

. Experimental results show that KineTy achieves superior alignment with captions and more legible, aesthetically pleasing letter motions across multiple metrics (e.g.,

) and user studies corroborate practical usefulness for kinetic typography creation. This approach enables editable, high-quality kinetic typography from text prompts, potentially transforming workflow efficiency in motion graphics and related applications.

Abstract

Paper Structure (19 sections, 8 equations, 7 figures, 4 tables)

This paper contains 19 sections, 8 equations, 7 figures, 4 tables.

Introduction
Related Work
Typography Generation
Typography Video Generation
Text-to-Video Diffusion Models
Kinetic Typography Dataset
Video Rendering
Video Captioning.
Ground-truth Video Generation
Kinetic Typography Diffusion Model
Preliminary
Spatial and Temporal Guidance
Enhancing Glyph Legibility.
Implementation Details
Experiments
...and 4 more sections

Figures (7)

Figure 1: An overview of our KineTy pipeline, which is motivated by the designer's workflow. Our key idea is to generate eye-catching and aesthetic animatable words based on user instructions.
Figure 2: Examples of our KineTy dataset. Our dataset provides high-quality kinetic typography video created by professional motion graphic designers, along with captions that describe the visual appearance and motion effects. To aid visualization, we provide three frames from each video clip.
Figure 3: Statistics of our proposed dataset.
Figure 4: An architecture of our KineTy model.
Figure 5: Qualitative results from the comparison models and ours. Obviously, the results from ours reflect the contents of captions better than the others.
...and 2 more figures

Kinetic Typography Diffusion Model

TL;DR

Abstract

Kinetic Typography Diffusion Model

Authors

TL;DR

Abstract

Table of Contents

Figures (7)