Table of Contents
Fetching ...

TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis

Yu Xie, Jielei Zhang, Pengyu Chen, Ziyue Wang, Weihang Wang, Longwen Gao, Peiyi Li, Huyang Sun, Qiang Zhang, Qian Qiao, Jiaqing Fan, Zhouhui Lian

TL;DR

TextFlux tackles the curse of glyph inaccuracy vs. scene integration in multilingual scene text synthesis by removing OCR-based conditioning and instead embedding spatial glyph guidance into a DiT-based diffusion inpainting backbone. Through glyph-template concatenation with the scene image and a flow-matching objective, it enables high-fidelity, multi-line, and multilingual text rendering with strong zero-shot generalization. The approach achieves state-of-the-art results on reconstruction and editing across multiple languages while reducing data requirements and simplifying training, albeit with substantial computational cost and limitations for cursive scripts. This work broadens practical accessibility for multilingual scene text synthesis and sets a foundation for further exploration of context-driven, OCR-free text rendering in complex visuals.

Abstract

Diffusion-based scene text synthesis has progressed rapidly, yet existing methods commonly rely on additional visual conditioning modules and require large-scale annotated data to support multilingual generation. In this work, we revisit the necessity of complex auxiliary modules and further explore an approach that simultaneously ensures glyph accuracy and achieves high-fidelity scene integration, by leveraging diffusion models' inherent capabilities for contextual reasoning. To this end, we introduce TextFlux, a DiT-based framework that enables multilingual scene text synthesis. The advantages of TextFlux can be summarized as follows: (1) OCR-free model architecture. TextFlux eliminates the need for OCR encoders (additional visual conditioning modules) that are specifically used to extract visual text-related features. (2) Strong multilingual scalability. TextFlux is effective in low-resource multilingual settings, and achieves strong performance in newly added languages with fewer than 1,000 samples. (3) Streamlined training setup. TextFlux is trained with only 1% of the training data required by competing methods. (4) Controllable multi-line text generation. TextFlux offers flexible multi-line synthesis with precise line-level control, outperforming methods restricted to single-line or rigid layouts. Extensive experiments and visualizations demonstrate that TextFlux outperforms previous methods in both qualitative and quantitative evaluations.

TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis

TL;DR

TextFlux tackles the curse of glyph inaccuracy vs. scene integration in multilingual scene text synthesis by removing OCR-based conditioning and instead embedding spatial glyph guidance into a DiT-based diffusion inpainting backbone. Through glyph-template concatenation with the scene image and a flow-matching objective, it enables high-fidelity, multi-line, and multilingual text rendering with strong zero-shot generalization. The approach achieves state-of-the-art results on reconstruction and editing across multiple languages while reducing data requirements and simplifying training, albeit with substantial computational cost and limitations for cursive scripts. This work broadens practical accessibility for multilingual scene text synthesis and sets a foundation for further exploration of context-driven, OCR-free text rendering in complex visuals.

Abstract

Diffusion-based scene text synthesis has progressed rapidly, yet existing methods commonly rely on additional visual conditioning modules and require large-scale annotated data to support multilingual generation. In this work, we revisit the necessity of complex auxiliary modules and further explore an approach that simultaneously ensures glyph accuracy and achieves high-fidelity scene integration, by leveraging diffusion models' inherent capabilities for contextual reasoning. To this end, we introduce TextFlux, a DiT-based framework that enables multilingual scene text synthesis. The advantages of TextFlux can be summarized as follows: (1) OCR-free model architecture. TextFlux eliminates the need for OCR encoders (additional visual conditioning modules) that are specifically used to extract visual text-related features. (2) Strong multilingual scalability. TextFlux is effective in low-resource multilingual settings, and achieves strong performance in newly added languages with fewer than 1,000 samples. (3) Streamlined training setup. TextFlux is trained with only 1% of the training data required by competing methods. (4) Controllable multi-line text generation. TextFlux offers flexible multi-line synthesis with precise line-level control, outperforming methods restricted to single-line or rigid layouts. Extensive experiments and visualizations demonstrate that TextFlux outperforms previous methods in both qualitative and quantitative evaluations.

Paper Structure

This paper contains 21 sections, 3 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Some examples of high-fidelity multilingual scene text images generated by our TextFlux.
  • Figure 2: TextFlux addresses the common conflict between glyph accuracy and stylistic integration in scene text synthesis. Prior works often exhibit either glyph errors (first column) or poor visual fidelity and integration (second column). In contrast, TextFlux accurately renders complex and multi-line text with high fidelity to the scene context (third and fourth columns).
  • Figure 3: Traditional methods employ OCR encoders to extract and inject various visual text features (e.g., font, glyph, color) as conditions. TextFlux streamlines the process by directly providing spatial glyph cues.
  • Figure 4: Overview of TextFlux. We propose an OCR-free scene text synthesis method that spatially concatenates glyph-rendered text with the original image as model input, enabling the diffusion transformer to leverage its inherent context-awareness to render text in the masked regions.
  • Figure 5: Comparison of scene text synthesis methods: AnyText, AnyText2, and our TextFlux. More results are available in the appendix.
  • ...and 11 more figures