Table of Contents
Fetching ...

LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis

Shitian Zhao, Qilong Wu, Xinyue Li, Bo Zhang, Ming Li, Qi Qin, Dongyang Liu, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Peng Gao, Bin Fu, Zhen Li

TL;DR

LeX-Art tackles the challenge of rendering multi-word text in generated images by adopting a data-centric pipeline that enriches prompts, curates high-quality data, and jointly optimizes lightweight and larger text-to-image models. It introduces LeX-10K, a high-quality 1024×1024 dataset produced through prompt enhancement, filtering, and knowledge-augmented recaptioning, followed by prompt enrichment with LeX-Enhancer and finetuning of LeX-FLUX and LeX-Lumina. A novel evaluation suite, LeX-Bench, and the PNED metric provide robust assessment of fidelity, aesthetics, and alignment, enabling comprehensive comparisons to glyph-based baselines. Empirical results show substantial improvements in text rendering accuracy and styling, with scalable gains as data size increases and via distillation, suggesting strong practical potential for high-quality visual text synthesis in design-oriented applications.

Abstract

We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024$\times$1024 images. Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance. To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation. Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). Our codes, models, datasets, and demo are publicly available.

LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis

TL;DR

LeX-Art tackles the challenge of rendering multi-word text in generated images by adopting a data-centric pipeline that enriches prompts, curates high-quality data, and jointly optimizes lightweight and larger text-to-image models. It introduces LeX-10K, a high-quality 1024×1024 dataset produced through prompt enhancement, filtering, and knowledge-augmented recaptioning, followed by prompt enrichment with LeX-Enhancer and finetuning of LeX-FLUX and LeX-Lumina. A novel evaluation suite, LeX-Bench, and the PNED metric provide robust assessment of fidelity, aesthetics, and alignment, enabling comprehensive comparisons to glyph-based baselines. Empirical results show substantial improvements in text rendering accuracy and styling, with scalable gains as data size increases and via distillation, suggesting strong practical potential for high-quality visual text synthesis in design-oriented applications.

Abstract

We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 10241024 images. Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance. To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation. Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). Our codes, models, datasets, and demo are publicly available.

Paper Structure

This paper contains 37 sections, 13 figures, 3 tables, 2 algorithms.

Figures (13)

  • Figure 1: Given the prompts for visual text generation, our proposed LeX-FLUX and LeX-Lumina can generate text images with multiple words, aesthetic complex layout, and good text attributes controllability.
  • Figure 2: Illustration of LeX-Art Framework.
  • Figure 3: The framework of data construction pipeline. The red words in the R1 enhanced prompt are not rendered in the generated image, and it is fixed after the knowledge-augmented recaption by gpt-4o.
  • Figure 4: Images generated by FLUX.1 [dev] flux2024 based on different prompts. The origin caption from the first raw to the bottom raw: (1) A poster with the words Good Music remixed and unreleased on it, with text on it: "UNRELEASED", "REMIXED", "GOOD.MUSIC", "KANYEWEST", "SPERIOD". (2) A movie poster, with text on it: "AFACE", "WITHOUT", "EYES", "DOL", "JUL". (3) A menu of a fast food restaurant that contains "Sandwich Combo", "Grilled Chicken", "Lettuce", "Tomato", "Mayo", "Fries&Drink", and "Pepsi".
  • Figure 5: Image quality score and image aesthetics score distribution of AnyText dataset tuo2023anytext and LeX-10K. We randomly sampled 10K data entries from AnyWord-3M. Using Q-Align wu2023qalign, we calculated the quality scores and aesthetic scores for these 10K data entries along with the images in LeX-10K, and visualized the distributions of these two types of scores. We observed that LeX-10K generally has higher quality scores and aesthetic scores overall.
  • ...and 8 more figures