Table of Contents
Fetching ...

EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

Runnan Lu, Yuxuan Zhang, Jiaming Liu, Haofan Wang, Yiren Song

TL;DR

The paper tackles multilingual text rendering in images using diffusion transformers, introducing EasyText which conditions glyph-based features via a VAE and enables precise placement with Implicit Character Position Alignment. A two-stage training regime—large-scale synthetic pretraining for glyph/spatial mapping and fine-tuning on 20K high-quality images—facilitates data-efficient multilingual rendering. Empirical results show superior text fidelity, layout control, and unseen-character generalization compared with baselines, along with strong text-image fusion qualities. The approach enables layout-aware, long-text rendering across languages with practical applicability in real-world scene text generation.

Abstract

Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an unexplored area. This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. We propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. Additionally, we construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images, which are used for pretraining and fine-tuning respectively. Extensive experiments and evaluations demonstrate the effectiveness and advancement of our approach in multilingual text rendering, visual quality, and layout-aware text integration.

EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

TL;DR

The paper tackles multilingual text rendering in images using diffusion transformers, introducing EasyText which conditions glyph-based features via a VAE and enables precise placement with Implicit Character Position Alignment. A two-stage training regime—large-scale synthetic pretraining for glyph/spatial mapping and fine-tuning on 20K high-quality images—facilitates data-efficient multilingual rendering. Empirical results show superior text fidelity, layout control, and unseen-character generalization compared with baselines, along with strong text-image fusion qualities. The approach enables layout-aware, long-text rendering across languages with practical applicability in real-world scene text generation.

Abstract

Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an unexplored area. This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. We propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. Additionally, we construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images, which are used for pretraining and fine-tuning respectively. Extensive experiments and evaluations demonstrate the effectiveness and advancement of our approach in multilingual text rendering, visual quality, and layout-aware text integration.

Paper Structure

This paper contains 24 sections, 5 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Text-rendered results generated by EasyText, which supports text rendering in over ten languages and produces high-quality results. It can render text either with explicit positional control or in a layout-free manner, and effectively handles curved and slanted regions. The displayed texts do not appear in the prompts; they are included solely to illustrate the intended rendering targets.
  • Figure 2: Overview of EasyText. We adopt a two-stage training strategy: large-scale pretraining for glyph generation and spatial mapping, followed by fine-tuning for visual-text integration and aesthetic refinement. Character positions from the condition input are aligned with target regions via implicit character position alignment, and training proceeds with image-conditioned LoRA.
  • Figure 3: Qualitative comparison of EasyText with other methods, focusing on the generation quality of both text and images, reveals that EasyText demonstrates outstanding performance.
  • Figure 4: Qualitative comparison of EasyText across multiple languages of the same prompt.
  • Figure 5: Comparison of text rendering on the pretraining synthetic dataset using multiple font overlays versus single font overlays. The red boxes highlight the erroneous regions.
  • ...and 9 more figures