Table of Contents
Fetching ...

Syn3DTxt: Embedding 3D Cues for Scene Text Generation

Li-Syun Hsiung, Jun-Kai Tu, Kuan-Wu Chu, Yu-Hsuan Chiu, Yan-Tsung Peng, Sheng-Luen Chung, Gee-Sern Jison Hsu

TL;DR

This work addresses the lack of 3D context in synthetic scene text data by introducing Syn3DTxt, a data-generation framework that embeds 3D cues via surface-normal RGB masks to better capture geometry, perspective, and curvature. The methodology renders detailed 3D text meshes with controllable background, content, curvature, orientation, and font, enabling disentangled learning of geometric transformations from appearance. A comprehensive dataset composition (Syn3DTxt and variants) and a staged training strategy demonstrate that 3D-augmented data improves perspective-consistent text editing, with quantitative gains in SSIM, FID, and editing accuracy across multiple benchmarks. The work also provides a public toolkit and datasets to advance 3D-aware scene text synthesis and editing in real-world conditions, highlighting the practical impact of explicit geometric supervision for robust text rendering.

Abstract

This study aims to investigate the challenge of insufficient three-dimensional context in synthetic datasets for scene text rendering. Although recent advances in diffusion models and related techniques have improved certain aspects of scene text generation, most existing approaches continue to rely on 2D data, sourcing authentic training examples from movie posters and book covers, which limits their ability to capture the complex interactions among spatial layout and visual effects in real-world scenes. In particular, traditional 2D datasets do not provide the necessary geometric cues for accurately embedding text into diverse backgrounds. To address this limitation, we propose a novel standard for constructing synthetic datasets that incorporates surface normals to enrich three-dimensional scene characteristic. By adding surface normals to conventional 2D data, our approach aims to enhance the representation of spatial relationships and provide a more robust foundation for future scene text rendering methods. Extensive experiments demonstrate that datasets built under this new standard offer improved geometric context, facilitating further advancements in text rendering under complex 3D-spatial conditions.

Syn3DTxt: Embedding 3D Cues for Scene Text Generation

TL;DR

This work addresses the lack of 3D context in synthetic scene text data by introducing Syn3DTxt, a data-generation framework that embeds 3D cues via surface-normal RGB masks to better capture geometry, perspective, and curvature. The methodology renders detailed 3D text meshes with controllable background, content, curvature, orientation, and font, enabling disentangled learning of geometric transformations from appearance. A comprehensive dataset composition (Syn3DTxt and variants) and a staged training strategy demonstrate that 3D-augmented data improves perspective-consistent text editing, with quantitative gains in SSIM, FID, and editing accuracy across multiple benchmarks. The work also provides a public toolkit and datasets to advance 3D-aware scene text synthesis and editing in real-world conditions, highlighting the practical impact of explicit geometric supervision for robust text rendering.

Abstract

This study aims to investigate the challenge of insufficient three-dimensional context in synthetic datasets for scene text rendering. Although recent advances in diffusion models and related techniques have improved certain aspects of scene text generation, most existing approaches continue to rely on 2D data, sourcing authentic training examples from movie posters and book covers, which limits their ability to capture the complex interactions among spatial layout and visual effects in real-world scenes. In particular, traditional 2D datasets do not provide the necessary geometric cues for accurately embedding text into diverse backgrounds. To address this limitation, we propose a novel standard for constructing synthetic datasets that incorporates surface normals to enrich three-dimensional scene characteristic. By adding surface normals to conventional 2D data, our approach aims to enhance the representation of spatial relationships and provide a more robust foundation for future scene text rendering methods. Extensive experiments demonstrate that datasets built under this new standard offer improved geometric context, facilitating further advancements in text rendering under complex 3D-spatial conditions.

Paper Structure

This paper contains 14 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Example of previous Dataset (a) MARIO-10M, constructed by textdiffuser, which captures real-world text instances predominantly within 2D imagery but lacks comprehensive 3D geometric annotations. (b) Synthetic dataset generated using the SRNetSRNet pipeline, which primarily applies simplified 2D warping transformations without incorporating 3D spatial details. These examples illustrate that existing datasets mainly consist of 2D images and rarely include accurate representations of text within realistic 3D environments, limiting their utility in training robust models capable of handling complex spatial interactions in scene text synthesis tasks.
  • Figure 2: Visualization of RGB-encoded normal vectors within a spherical coordinate system. Each point on the sphere represents a distinct orientation, with its normal vector coordinates mapped directly to RGB colors. By connecting these spherical points to corresponding text images generated at specific rotation angles, we illustrate how text rendering outcomes vary according to precise 3D orientations. All angles follow the defined order ($\theta$, $\phi$, $\gamma$).
  • Figure 3: Example of generated text data with three-dimensional bending effects. The first column shows the rendered text images; the second column displays the corresponding normal vector masks encoded in RGB, highlighting detailed 3D spatial characteristics; and the third column presents binary masks indicating text regions. Unlike simple planar rotations, our approach assigns distinct normal vectors to each character, enabling more accurate modeling of the complex geometric transformations commonly observed in real-world scenes.
  • Figure 4: Qualitative Comparison between 2D and 3D models
  • Figure 5: Qualitative comparison between the original MOSTEL and our enhanced MOSTEL 3D on ScenePair across three representative cases: (i) Left block, Failure cases of the original MOSTEL; (ii) Middle block, the original MOSTEL successfully edits the text but fails to restore the background; (iii) Right block, successful cases of both MOSTELs
  • ...and 1 more figures