Type-R: Automatically Retouching Typos for Text-to-Image Generation
Wataru Shimoda, Naoto Inoue, Daichi Haraguchi, Hayato Mitani, Seiichi Uchida, Kota Yamaguchi
TL;DR
This work tackles the persistent challenge of rendering accurate and legible text in text-to-image generation. It introduces Type-R, a post-processing pipeline with four automatic stages—error detection, text erasing, layout regeneration, and typo correction—that leverages external OCR, vision-language models, and text-editing tools without retraining the base generator. Through experiments on MARIO-Eval, Type-R consistently improves OCR-based text accuracy while maintaining or enhancing graphic design quality, and it outperforms typography-centric baselines in the quality-accuracy trade-off. The method is plug-and-play across backbones like Stable Diffusion and Flux, offering practical impact for applications requiring reliable in-image text rendering, such as posters, ads, and signage.
Abstract
While recent text-to-image models can generate photorealistic images from text prompts that reflect detailed instructions, they still face significant challenges in accurately rendering words in the image. In this paper, we propose to retouch erroneous text renderings in the post-processing pipeline. Our approach, called Type-R, identifies typographical errors in the generated image, erases the erroneous text, regenerates text boxes for missing words, and finally corrects typos in the rendered words. Through extensive experiments, we show that Type-R, in combination with the latest text-to-image models such as Stable Diffusion or Flux, achieves the highest text rendering accuracy while maintaining image quality and also outperforms text-focused generation baselines in terms of balancing text accuracy and image quality.
