Table of Contents
Fetching ...

Type-R: Automatically Retouching Typos for Text-to-Image Generation

Wataru Shimoda, Naoto Inoue, Daichi Haraguchi, Hayato Mitani, Seiichi Uchida, Kota Yamaguchi

TL;DR

This work tackles the persistent challenge of rendering accurate and legible text in text-to-image generation. It introduces Type-R, a post-processing pipeline with four automatic stages—error detection, text erasing, layout regeneration, and typo correction—that leverages external OCR, vision-language models, and text-editing tools without retraining the base generator. Through experiments on MARIO-Eval, Type-R consistently improves OCR-based text accuracy while maintaining or enhancing graphic design quality, and it outperforms typography-centric baselines in the quality-accuracy trade-off. The method is plug-and-play across backbones like Stable Diffusion and Flux, offering practical impact for applications requiring reliable in-image text rendering, such as posters, ads, and signage.

Abstract

While recent text-to-image models can generate photorealistic images from text prompts that reflect detailed instructions, they still face significant challenges in accurately rendering words in the image. In this paper, we propose to retouch erroneous text renderings in the post-processing pipeline. Our approach, called Type-R, identifies typographical errors in the generated image, erases the erroneous text, regenerates text boxes for missing words, and finally corrects typos in the rendered words. Through extensive experiments, we show that Type-R, in combination with the latest text-to-image models such as Stable Diffusion or Flux, achieves the highest text rendering accuracy while maintaining image quality and also outperforms text-focused generation baselines in terms of balancing text accuracy and image quality.

Type-R: Automatically Retouching Typos for Text-to-Image Generation

TL;DR

This work tackles the persistent challenge of rendering accurate and legible text in text-to-image generation. It introduces Type-R, a post-processing pipeline with four automatic stages—error detection, text erasing, layout regeneration, and typo correction—that leverages external OCR, vision-language models, and text-editing tools without retraining the base generator. Through experiments on MARIO-Eval, Type-R consistently improves OCR-based text accuracy while maintaining or enhancing graphic design quality, and it outperforms typography-centric baselines in the quality-accuracy trade-off. The method is plug-and-play across backbones like Stable Diffusion and Flux, offering practical impact for applications requiring reliable in-image text rendering, such as posters, ads, and signage.

Abstract

While recent text-to-image models can generate photorealistic images from text prompts that reflect detailed instructions, they still face significant challenges in accurately rendering words in the image. In this paper, we propose to retouch erroneous text renderings in the post-processing pipeline. Our approach, called Type-R, identifies typographical errors in the generated image, erases the erroneous text, regenerates text boxes for missing words, and finally corrects typos in the rendered words. Through extensive experiments, we show that Type-R, in combination with the latest text-to-image models such as Stable Diffusion or Flux, achieves the highest text rendering accuracy while maintaining image quality and also outperforms text-focused generation baselines in terms of balancing text accuracy and image quality.

Paper Structure

This paper contains 43 sections, 2 equations, 14 figures, 12 tables, 1 algorithm.

Figures (14)

  • Figure 1: Illustration of the Type-R pipeline. Type-R automatically detects errors, erases unintended texts, inserts missing words, and corrects spelling errors in the image.
  • Figure 2: Comparisons of generated images by Flux, Flux w/ Type-R, and TextDiffuser models. The left column shows prompts, and the right columns present generated images by each method.
  • Figure 3: Examples of naive baselines.
  • Figure 4: Plot of the relation between OCR accuracy and graphic design quality by GPT. Raw represents the results of text-to-image generation with prompt augmentation.
  • Figure 5: Examples of generated images through Type-R.
  • ...and 9 more figures