JoyType: A Robust Design for Multilingual Visual Text Creation

Chao Li; Chen Jiang; Xiaolong Liu; Jun Zhao; Guoxin Wang

JoyType: A Robust Design for Multilingual Visual Text Creation

Chao Li, Chen Jiang, Xiaolong Liu, Jun Zhao, Guoxin Wang

TL;DR

JoyType addresses the challenge of reliably rendering multilingual text with preserved font styles in diffusion-based image generation. It introduces Font ControlNet and two font-specific hints (Canny and Font hints), coupled with a multi-layer OCR perceptual loss to improve small-text rendering, and trains on the 1M-scale JoyType-1M dataset. The approach outperforms state-of-the-art baselines on font-preservation and recognizability across languages, while maintaining flexibility as a plugin for other diffusion models. This work offers a practical, scalable solution for font-faithful visual text synthesis with broad applicability in design and cross-lingual image generation.

Abstract

Generating images with accurately represented text, especially in non-Latin languages, poses a significant challenge for diffusion models. Existing approaches, such as the integration of hint condition diagrams via auxiliary networks (e.g., ControlNet), have made strides towards addressing this issue. However, diffusion models often fall short in tasks requiring controlled text generation, such as specifying particular fonts or producing text in small fonts. In this paper, we introduce a novel approach for multilingual visual text creation, named JoyType, designed to maintain the font style of text during the image generation process. Our methodology begins with assembling a training dataset, JoyType-1M, comprising 1 million pairs of data. Each pair includes an image, its description, and glyph instructions corresponding to the font style within the image. We then developed a text control network, Font ControlNet, tasked with extracting font style information to steer the image generation. To further enhance our model's ability to maintain font style, notably in generating small-font text, we incorporated a multi-layer OCR-aware loss into the diffusion process. This enhancement allows JoyType to direct text rendering using low-level descriptors. Our evaluations, based on both visual and accuracy metrics, demonstrate that JoyType significantly outperforms existing state-of-the-art methods. Additionally, JoyType can function as a plugin, facilitating the creation of varied image styles in conjunction with other stable diffusion models on HuggingFace and CivitAI. Our project is open-sourced on https://jdh-algo.github.io/JoyType/.

JoyType: A Robust Design for Multilingual Visual Text Creation

TL;DR

Abstract

Paper Structure (16 sections, 3 equations, 6 figures, 5 tables)

This paper contains 16 sections, 3 equations, 6 figures, 5 tables.

Introduction
Related Work
Proposed JoyType
Text-Control Training Pipeline
Multi-layer OCR Perceptual Loss
Inference Pipeline
Experiments
Data Collection
Implementation Details
Baselines and Evaluations
Font Style Preserving
Comparison JoyType with SOTAs
Ablation Studies
Evaluation on Small Text Generation.
Discussion and Limitations
...and 1 more sections

Figures (6)

Figure 1: Compared to the commonly used glyph hint (b), JoyType introduces two new kinds of hint instructions: (c) Canny hint and (d) Font hint.
Figure 2: Illustration of JoyType's capacity to render high-fidelity multilingual text images.
Figure 3: The comprehensive framework of JoyType, illustrating the training pipeline, inference process, and data collection.
Figure 4: Using various different font styles as hint condition images to evaluate JoyType's ability to maintain glyphs. All images use the same prompt of "a card." We label the standard style of each font (hint image) at the top of each image.
Figure 5: More examples of JoyType in text generation.
...and 1 more figures

JoyType: A Robust Design for Multilingual Visual Text Creation

TL;DR

Abstract

JoyType: A Robust Design for Multilingual Visual Text Creation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)