Table of Contents
Fetching ...

Video Text Preservation with Synthetic Text-Rich Videos

Ziyang Liu, Kevin Valencia, Justin Cui

TL;DR

This paper tackles the persistent challenge of legible text in text-to-video diffusion outputs. It introduces a lightweight synthetic supervision pipeline that creates text-rich images via a text-to-image model, animates them with a text-free image-to-video model, and then fine-tunes Wan2.1 on the resulting data without architectural changes. The key findings show improved short-text legibility and temporal consistency, with emergent structural priors for longer text, suggesting typography patterns can be learned even without exact word-level accuracy. The proposed synthetic-data approach offers a practical, low-cost path to enhance textual fidelity in video generation and could enable more reliable text rendering in T2V systems.

Abstract

While Text-To-Video (T2V) models have advanced rapidly, they continue to struggle with generating legible and coherent text within videos. In particular, existing models often fail to render correctly even short phrases or words and previous attempts to address this problem are computationally expensive and not suitable for video generation. In this work, we investigate a lightweight approach to improve T2V diffusion models using synthetic supervision. We first generate text-rich images using a text-to-image (T2I) diffusion model, then animate them into short videos using a text-agnostic image-to-video (I2v) model. These synthetic video-prompt pairs are used to fine-tune Wan2.1, a pre-trained T2V model, without any architectural changes. Our results show improvement in short-text legibility and temporal consistency with emerging structural priors for longer text. These findings suggest that curated synthetic data and weak supervision offer a practical path toward improving textual fidelity in T2V generation.

Video Text Preservation with Synthetic Text-Rich Videos

TL;DR

This paper tackles the persistent challenge of legible text in text-to-video diffusion outputs. It introduces a lightweight synthetic supervision pipeline that creates text-rich images via a text-to-image model, animates them with a text-free image-to-video model, and then fine-tunes Wan2.1 on the resulting data without architectural changes. The key findings show improved short-text legibility and temporal consistency, with emergent structural priors for longer text, suggesting typography patterns can be learned even without exact word-level accuracy. The proposed synthetic-data approach offers a practical, low-cost path to enhance textual fidelity in video generation and could enable more reliable text rendering in T2V systems.

Abstract

While Text-To-Video (T2V) models have advanced rapidly, they continue to struggle with generating legible and coherent text within videos. In particular, existing models often fail to render correctly even short phrases or words and previous attempts to address this problem are computationally expensive and not suitable for video generation. In this work, we investigate a lightweight approach to improve T2V diffusion models using synthetic supervision. We first generate text-rich images using a text-to-image (T2I) diffusion model, then animate them into short videos using a text-agnostic image-to-video (I2v) model. These synthetic video-prompt pairs are used to fine-tune Wan2.1, a pre-trained T2V model, without any architectural changes. Our results show improvement in short-text legibility and temporal consistency with emerging structural priors for longer text. These findings suggest that curated synthetic data and weak supervision offer a practical path toward improving textual fidelity in T2V generation.

Paper Structure

This paper contains 10 sections, 6 figures.

Figures (6)

  • Figure 1: Model still struggles with long/complicated tasks. However, it is able to capture the structure of the texts
  • Figure : A book about Distillation
  • Figure : A book about Distillation
  • Figure : A chill guy walking on the street
  • Figure : A man holding Hello World flag on the grass.
  • ...and 1 more figures