Syn3DTxt: Embedding 3D Cues for Scene Text Generation

Li-Syun Hsiung; Jun-Kai Tu; Kuan-Wu Chu; Yu-Hsuan Chiu; Yan-Tsung Peng; Sheng-Luen Chung; Gee-Sern Jison Hsu

Syn3DTxt: Embedding 3D Cues for Scene Text Generation

Li-Syun Hsiung, Jun-Kai Tu, Kuan-Wu Chu, Yu-Hsuan Chiu, Yan-Tsung Peng, Sheng-Luen Chung, Gee-Sern Jison Hsu

TL;DR

This work addresses the lack of 3D context in synthetic scene text data by introducing Syn3DTxt, a data-generation framework that embeds 3D cues via surface-normal RGB masks to better capture geometry, perspective, and curvature. The methodology renders detailed 3D text meshes with controllable background, content, curvature, orientation, and font, enabling disentangled learning of geometric transformations from appearance. A comprehensive dataset composition (Syn3DTxt and variants) and a staged training strategy demonstrate that 3D-augmented data improves perspective-consistent text editing, with quantitative gains in SSIM, FID, and editing accuracy across multiple benchmarks. The work also provides a public toolkit and datasets to advance 3D-aware scene text synthesis and editing in real-world conditions, highlighting the practical impact of explicit geometric supervision for robust text rendering.

Abstract

This study aims to investigate the challenge of insufficient three-dimensional context in synthetic datasets for scene text rendering. Although recent advances in diffusion models and related techniques have improved certain aspects of scene text generation, most existing approaches continue to rely on 2D data, sourcing authentic training examples from movie posters and book covers, which limits their ability to capture the complex interactions among spatial layout and visual effects in real-world scenes. In particular, traditional 2D datasets do not provide the necessary geometric cues for accurately embedding text into diverse backgrounds. To address this limitation, we propose a novel standard for constructing synthetic datasets that incorporates surface normals to enrich three-dimensional scene characteristic. By adding surface normals to conventional 2D data, our approach aims to enhance the representation of spatial relationships and provide a more robust foundation for future scene text rendering methods. Extensive experiments demonstrate that datasets built under this new standard offer improved geometric context, facilitating further advancements in text rendering under complex 3D-spatial conditions.

Syn3DTxt: Embedding 3D Cues for Scene Text Generation

TL;DR

Abstract

Syn3DTxt: Embedding 3D Cues for Scene Text Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)