Table of Contents
Fetching ...

Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors

Tongkun Guan, Wei Shen, Xue Yang, Xuehui Wang, Xiaokang Yang

TL;DR

FreeReal addresses the synth-to-real and language-to-language gaps in pre-training scene text detectors by combining labeled synthetic data (LSD) with unlabeled real data (URD) in a real-domain-aligned framework. Central to this approach are GlyphMix, which grafts synthetic glyphs into real images, and Character Region Awareness (CRA), which emphasizes character-level learning to bridge multilingual data. The method operates in a student–teacher setup with adaptive loss masking and pseudo-labels to leverage URD without domain drift. Empirical results on four benchmarks show consistent gains across multiple detectors, confirming the effectiveness of jointly exploiting LSD and URD with relatively small amounts of synthetic data.

Abstract

Existing scene text detection methods typically rely on extensive real data for training. Due to the lack of annotated real images, recent works have attempted to exploit large-scale labeled synthetic data (LSD) for pre-training text detectors. However, a synth-to-real domain gap emerges, further limiting the performance of text detectors. Differently, in this work, we propose FreeReal, a real-domain-aligned pre-training paradigm that enables the complementary strengths of both LSD and unlabeled real data (URD). Specifically, to bridge real and synthetic worlds for pre-training, a glyph-based mixing mechanism (GlyphMix) is tailored for text images.GlyphMix delineates the character structures of synthetic images and embeds them as graffiti-like units onto real images. Without introducing real domain drift, GlyphMix freely yields real-world images with annotations derived from synthetic labels. Furthermore, when given free fine-grained synthetic labels, GlyphMix can effectively bridge the linguistic domain gap stemming from English-dominated LSD to URD in various languages. Without bells and whistles, FreeReal achieves average gains of 1.97%, 3.90%, 3.85%, and 4.56% in improving the performance of FCENet, PSENet, PANet, and DBNet methods, respectively, consistently outperforming previous pre-training methods by a substantial margin across four public datasets. Code is available at https://github.com/SJTU-DeepVisionLab/FreeReal.

Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors

TL;DR

FreeReal addresses the synth-to-real and language-to-language gaps in pre-training scene text detectors by combining labeled synthetic data (LSD) with unlabeled real data (URD) in a real-domain-aligned framework. Central to this approach are GlyphMix, which grafts synthetic glyphs into real images, and Character Region Awareness (CRA), which emphasizes character-level learning to bridge multilingual data. The method operates in a student–teacher setup with adaptive loss masking and pseudo-labels to leverage URD without domain drift. Empirical results on four benchmarks show consistent gains across multiple detectors, confirming the effectiveness of jointly exploiting LSD and URD with relatively small amounts of synthetic data.

Abstract

Existing scene text detection methods typically rely on extensive real data for training. Due to the lack of annotated real images, recent works have attempted to exploit large-scale labeled synthetic data (LSD) for pre-training text detectors. However, a synth-to-real domain gap emerges, further limiting the performance of text detectors. Differently, in this work, we propose FreeReal, a real-domain-aligned pre-training paradigm that enables the complementary strengths of both LSD and unlabeled real data (URD). Specifically, to bridge real and synthetic worlds for pre-training, a glyph-based mixing mechanism (GlyphMix) is tailored for text images.GlyphMix delineates the character structures of synthetic images and embeds them as graffiti-like units onto real images. Without introducing real domain drift, GlyphMix freely yields real-world images with annotations derived from synthetic labels. Furthermore, when given free fine-grained synthetic labels, GlyphMix can effectively bridge the linguistic domain gap stemming from English-dominated LSD to URD in various languages. Without bells and whistles, FreeReal achieves average gains of 1.97%, 3.90%, 3.85%, and 4.56% in improving the performance of FCENet, PSENet, PANet, and DBNet methods, respectively, consistently outperforming previous pre-training methods by a substantial margin across four public datasets. Code is available at https://github.com/SJTU-DeepVisionLab/FreeReal.
Paper Structure (11 sections, 6 equations, 3 figures, 7 tables)

This paper contains 11 sections, 6 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: We present different pre-training paradigms for enhancing text detectors.
  • Figure 2: We present the domain classification task and the performances of different domain bridging ways between LSD and URD. In the text detection area, experiments demonstrate that our domain bridging method significantly performs better in aligning with the real domain. This is easy to explain: (I) Mixup may lead to unexpected pixel-wise ambiguities; (II) CutMix may cause the absence of many text regions and incomplete semantics of texts; (III) ClassMix will introduce salient boundary priors; (IV) GlyphMix (ours) effectively preserves the semantic information of text without causing real domain drift and bringing extra boundary priors.
  • Figure 3: Our network pipeline in a student-teacher framework.