Table of Contents
Fetching ...

What Is Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution

Xingsong Ye, Yongkun Du, JiaXin Zhang, Chen Li, Jing LYU, Zhineng Chen

TL;DR

This work tackles the persistent domain gap between real images and synthetic data for scene text recognition by systematically analyzing existing rendering-based synthetic datasets and their failure modes. It introduces UnionST, a strong synthetic data engine with diversified corpora, fonts, and layouts, plus a self-evolution learning framework (SEL) that pseudo-labels unlabeled real data to create UnionST-P and UnionST-SP datasets. Empirical results show UnionST-S/UnionST-SP outperform traditional synthetic datasets and even approach or exceed real-data performance on several benchmarks, while SEL dramatically reduces the need for manual annotation (down to around 9% of labels in some setups). The approach demonstrates substantial practical value for data-scarce settings and offers a scalable path toward near-real STR performance with large-scale synthetic data and selective labeling, with future work extending to font filtering and multilingual STR.

Abstract

Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its performance often lags behind, revealing a significant domain gap between real and current synthetic data. In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce UnionST, a strong data engine synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct UnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation. Experiments show that models trained on UnionST-S achieve significant improvements over existing synthetic datasets. They even surpass real-data performance in certain scenarios. Moreover, when using SEL, the trained models achieve competitive performance by only seeing 9% of real data labels.

What Is Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution

TL;DR

This work tackles the persistent domain gap between real images and synthetic data for scene text recognition by systematically analyzing existing rendering-based synthetic datasets and their failure modes. It introduces UnionST, a strong synthetic data engine with diversified corpora, fonts, and layouts, plus a self-evolution learning framework (SEL) that pseudo-labels unlabeled real data to create UnionST-P and UnionST-SP datasets. Empirical results show UnionST-S/UnionST-SP outperform traditional synthetic datasets and even approach or exceed real-data performance on several benchmarks, while SEL dramatically reduces the need for manual annotation (down to around 9% of labels in some setups). The approach demonstrates substantial practical value for data-scarce settings and offers a scalable path toward near-real STR performance with large-scale synthetic data and selective labeling, with future work extending to font filtering and multilingual STR.

Abstract

Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its performance often lags behind, revealing a significant domain gap between real and current synthetic data. In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce UnionST, a strong data engine synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct UnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation. Experiments show that models trained on UnionST-S achieve significant improvements over existing synthetic datasets. They even surpass real-data performance in certain scenarios. Moreover, when using SEL, the trained models achieve competitive performance by only seeing 9% of real data labels.
Paper Structure (20 sections, 7 equations, 11 figures, 9 tables, 2 algorithms)

This paper contains 20 sections, 7 equations, 11 figures, 9 tables, 2 algorithms.

Figures (11)

  • Figure 1: Comparison of two scene text data synthesis paradigms. "T2I" stands for Text-to-Image and red highlights indicate that the edited or rendered images do not meet the specified condition.
  • Figure 2: Left: Quantitative comparisons across multiple scenarios, including common and seven challenging cases, are performed based on the normalized accuracy rate (%). Right: Illustrative examples of traditional synthetic engines (MJ mj, ST st, CurvedST curvest, SynthAdd li2019show, and SynthTIGER yim2021synthtiger) and our UnionST, which provides more diverse (various text layouts and content), realistic, and challenging samples.
  • Figure 3: Real text samples (the top line in each subset) and data generated by UnionST (the bottom line) grouped by subsets according to their challenges. Beyond the subsets identified in Union14M jiang2023revisiting, we introduce three additional ones: Multi-Sized (words with varying sizes, including subscripts and superscripts), Perspective (variations in viewpoint), and Degraded (blur or low resolution caused by camera shake or small text size).
  • Figure 4: Pipelines of the UnionST data engine (top) and our SEL framework (bottom). Top: We randomly sample a text (e.g., UnionST) from the corpus, select a font (e.g., arial.ttf) that supports all characters, and render the text using this font in a color chosen from a predefined colormap. The rendered text is then placed onto a background with effects using our placement algorithm. For more UnionST’s data visualization, see Tab. \ref{['fig:data_view']} in the supplementary. Bottom: The lines represent the number of passes. Left: UnionST-S and UnionST-P come from different corpus, and UnionST-P is combined with UnionST-S for STR model retraining. Right: Pseudo-labels undergo two rounds of self-iteration, followed by one round with manually annotated data. Each iteration fine-tunes the previous model.
  • Figure 5: Word clouds (left) and word length distributions (right) for the synthetic corpus (top) and the real (bottom).
  • ...and 6 more figures