Table of Contents
Fetching ...

Text Image Generation for Low-Resource Languages with Dual Translation Learning

Chihiro Noguchi, Shun Fukuda, Shoichiro Mihara, Masao Yamanaka

TL;DR

This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages by utilizing a diffusion model that is conditioned on binary states: ``synthetic'' and ``real.

Abstract

Scene text recognition in low-resource languages frequently faces challenges due to the limited availability of training datasets derived from real-world scenes. This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages. Our approach utilizes a diffusion model that is conditioned on binary states: ``synthetic'' and ``real.'' The training of this model involves dual translation tasks, where it transforms plain text images into either synthetic or real text images, based on the binary states. This approach not only effectively differentiates between the two domains but also facilitates the model's explicit recognition of characters in the target language. Furthermore, to enhance the accuracy and variety of generated text images, we introduce two guidance techniques: Fidelity-Diversity Balancing Guidance and Fidelity Enhancement Guidance. Our experimental results demonstrate that the text images generated by our proposed framework can significantly improve the performance of scene text recognition models for low-resource languages.

Text Image Generation for Low-Resource Languages with Dual Translation Learning

TL;DR

This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages by utilizing a diffusion model that is conditioned on binary states: ``synthetic'' and ``real.

Abstract

Scene text recognition in low-resource languages frequently faces challenges due to the limited availability of training datasets derived from real-world scenes. This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages. Our approach utilizes a diffusion model that is conditioned on binary states: ``synthetic'' and ``real.'' The training of this model involves dual translation tasks, where it transforms plain text images into either synthetic or real text images, based on the binary states. This approach not only effectively differentiates between the two domains but also facilitates the model's explicit recognition of characters in the target language. Furthermore, to enhance the accuracy and variety of generated text images, we introduce two guidance techniques: Fidelity-Diversity Balancing Guidance and Fidelity Enhancement Guidance. Our experimental results demonstrate that the text images generated by our proposed framework can significantly improve the performance of scene text recognition models for low-resource languages.
Paper Structure (18 sections, 3 equations, 11 figures, 5 tables)

This paper contains 18 sections, 3 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: While the source language has both synthetic and real text images available, the target language only possesses synthetic text images. This study aims to generate text images in the target language that emulate the style of real text images.
  • Figure 2: Overview of the proposed framework for text image generation. At the training phase, the DM is trained to generate synthetic and real text images, governed by a binary state input: either synth or real. Plain text images are used as input to provide their corresponding textual content. At the inference phase, plain text images in the target language are fed into the model under the real condition. FDB Guidance empowers the model to generate text images with enhanced precision and variety. Moreover, FE Guidance can further improve the text content fidelity of the generated text images.
  • Figure 3: Examples of text images generated using a constant guidance scale $w\in \{1.25, 2, 2.75\}$, and using FDB Guidance.
  • Figure 4: Diffusion model architecture in the proposed framework. Plain images are conditioned by being concatenated with input noisy images. Additionally, binary variables are conditioned through both timestep embeddings and cross-attention layers.
  • Figure 5: Qualitative comparison of generated text images. The top row displays plain text images, while the subsequent rows show text images generated by different methods, all sharing the same textual content as their corresponding plain text images.
  • ...and 6 more figures