Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

Zeyu Liu; Weicong Liang; Yiming Zhao; Bohan Chen; Lin Liang; Lijuan Wang; Ji Li; Yuhui Yuan

Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

Zeyu Liu, Weicong Liang, Yiming Zhao, Bohan Chen, Lin Liang, Lijuan Wang, Ji Li, Yuhui Yuan

TL;DR

The paper presents Glyph-ByT5-v2 and Glyph-SDXL-v2 to enable accurate multilingual visual text rendering across 10 languages while boosting aesthetic quality. It introduces large-scale multilingual datasets (1M glyph-text pairs and 10M design images), a Multilingual VisualParagraphy benchmark, and step-aware preference learning with Albedo to improve aesthetics. The approach combines translation-based data augmentation, glyph augmentation, and cross-language fusion with region-wise cross-attention, achieving strong multilingual spelling accuracy alongside competitive or superior aesthetics versus models like DALL·E3. Experimental results include objective OCR metrics and human studies, with DALL·E3 comparisons confirming substantial user preference for Glyph-SDXL-v2. This work provides a practical, scalable baseline for multilingual visual-text rendering in modern TTI systems.

Abstract

Recently, Glyph-ByT5 has achieved highly accurate visual text rendering performance in graphic design images. However, it still focuses solely on English and performs relatively poorly in terms of visual appeal. In this work, we address these two fundamental limitations by presenting Glyph-ByT5-v2 and Glyph-SDXL-v2, which not only support accurate visual text rendering for 10 different languages but also achieve much better aesthetic quality. To achieve this, we make the following contributions: (i) creating a high-quality multilingual glyph-text and graphic design dataset consisting of more than 1 million glyph-text pairs and 10 million graphic design image-text pairs covering nine other languages, (ii) building a multilingual visual paragraph benchmark consisting of 1,000 prompts, with 100 for each language, to assess multilingual visual spelling accuracy, and (iii) leveraging the latest step-aware preference learning approach to enhance the visual aesthetic quality. With the combination of these techniques, we deliver a powerful customized multilingual text encoder, Glyph-ByT5-v2, and a strong aesthetic graphic generation model, Glyph-SDXL-v2, that can support accurate spelling in 10 different languages. We perceive our work as a significant advancement, considering that the latest DALL-E3 and Ideogram 1.0 still struggle with the multilingual visual text rendering task.

Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

TL;DR

Abstract

Paper Structure (12 sections, 7 figures, 8 tables)

This paper contains 12 sections, 7 figures, 8 tables.

Introduction
Related Work
Our Approach
Multilingual Glyph-ByT5
Multilingual Glyph-SDXL
Experiment
Training Settings
Evaluation Metrics
Multilingual VisualParagraphy Benchmark
Improving Aesthetics with SPO-SDXL
Comparison with DALL$\cdot$E3
Conclusion

Figures (7)

Figure 1: Illustrating the multilingual visual text rendering results with our approach. We show the French, Spanish, Chinese, Japanese, and Korean visual text results in the 1st, 2nd, 3rd, 4th, and 5th rows, respectively.
Figure 2: Glyph-SDXL-v2 vs. Glyph-SDXL in graphic design images in terms of multilingual visual text spelling accuracy, layout quality, and visual aesthetics win-rates evaluated by human evaluator preferences. The only difference between Glyph-SDXL-v2 and Glyph-SDXL is that we replace SDXL with Albedo SDXL + SPO.
Figure 3: Illustrating the multilingual visual text rendering results with our approach. We show the German, Portuguese, Italian, and Russian visual text results in the 1st, 2nd, 3rd, and 4th rows, respectively.
Figure 4: Visualization of multilingual generation results by DALL$\cdot$E3 and Ideogram 1.0.
Figure 5: Illustrating the similarly-shaped character replacement strategy for Chinese glyph augmentation.
...and 2 more figures

Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

TL;DR

Abstract

Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

Authors

TL;DR

Abstract

Table of Contents

Figures (7)