Table of Contents
Fetching ...

Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR

Peirong Zhang, Haowei Xu, Jiaxin Zhang, Guitao Xu, Xuhan Zheng, Zhenhua Yang, Junle Liu, Yuyi Zhang, Lianwen Jin

TL;DR

This work tackles the lack of systematic evaluation for OCR-focused text image generation and editing by benchmarking seven state-of-the-art models across 33 tasks spanning documents, handwriting, scene text, artistic text, and complex layouts in English and Chinese. It reveals widespread limitations in text localization, content preservation, multilingual rendering, and layout integrity, with closed-source models like GPT-4o generally outperforming open-source ones. The authors argue that photorealistic text generation should be internalized as a foundational capability of general-domain generative models rather than relegated to specialized solutions, and they publish an online, continuously updated GitHub evaluation to guide future research. The study provides actionable insights for building more robust OCR-capable generation/editing abilities into broad-model architectures to enhance downstream OCR and information-extraction tasks.

Abstract

Text image is a unique and crucial information medium that integrates visual aesthetics and linguistic semantics in modern e-society. Due to their subtlety and complexity, the generation of text images represents a challenging and evolving frontier in the image generation field. The recent surge of specialized image generators (\emph{e.g.}, Flux-series) and unified generative models (\emph{e.g.}, GPT-4o), which demonstrate exceptional fidelity, raises a natural question: can they master the intricacies of text image generation and editing? Motivated by this, we assess current state-of-the-art generative models' capabilities in terms of text image generation and editing. We incorporate various typical optical character recognition (OCR) tasks into our evaluation and broaden the concept of text-based generation tasks into OCR generative tasks. We select 33 representative tasks and categorize them into five categories: document, handwritten text, scene text, artistic text, and complex \& layout-rich text. For comprehensive evaluation, we examine six models across both closed-source and open-source domains, using tailored, high-quality image inputs and prompts. Through this evaluation, we draw crucial observations and identify the weaknesses of current generative models for OCR tasks. We argue that photorealistic text image generation and editing should be internalized as foundational skills into general-domain generative models, rather than being delegated to specialized solutions, and we hope this empirical analysis can provide valuable insights for the community to achieve this goal. This evaluation is online and will be continuously updated at our GitHub repository.

Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR

TL;DR

This work tackles the lack of systematic evaluation for OCR-focused text image generation and editing by benchmarking seven state-of-the-art models across 33 tasks spanning documents, handwriting, scene text, artistic text, and complex layouts in English and Chinese. It reveals widespread limitations in text localization, content preservation, multilingual rendering, and layout integrity, with closed-source models like GPT-4o generally outperforming open-source ones. The authors argue that photorealistic text generation should be internalized as a foundational capability of general-domain generative models rather than relegated to specialized solutions, and they publish an online, continuously updated GitHub evaluation to guide future research. The study provides actionable insights for building more robust OCR-capable generation/editing abilities into broad-model architectures to enhance downstream OCR and information-extraction tasks.

Abstract

Text image is a unique and crucial information medium that integrates visual aesthetics and linguistic semantics in modern e-society. Due to their subtlety and complexity, the generation of text images represents a challenging and evolving frontier in the image generation field. The recent surge of specialized image generators (\emph{e.g.}, Flux-series) and unified generative models (\emph{e.g.}, GPT-4o), which demonstrate exceptional fidelity, raises a natural question: can they master the intricacies of text image generation and editing? Motivated by this, we assess current state-of-the-art generative models' capabilities in terms of text image generation and editing. We incorporate various typical optical character recognition (OCR) tasks into our evaluation and broaden the concept of text-based generation tasks into OCR generative tasks. We select 33 representative tasks and categorize them into five categories: document, handwritten text, scene text, artistic text, and complex \& layout-rich text. For comprehensive evaluation, we examine six models across both closed-source and open-source domains, using tailored, high-quality image inputs and prompts. Through this evaluation, we draw crucial observations and identify the weaknesses of current generative models for OCR tasks. We argue that photorealistic text image generation and editing should be internalized as foundational skills into general-domain generative models, rather than being delegated to specialized solutions, and we hope this empirical analysis can provide valuable insights for the community to achieve this goal. This evaluation is online and will be continuously updated at our GitHub repository.

Paper Structure

This paper contains 38 sections, 21 figures, 1 table.

Figures (21)

  • Figure 1: Task categorization and subordination. We classify 33 OCR generative tasks into five primary categories based on text characteristics: document, handwritten text, scene text, artistic text, and complex & layout-rich (CLR) text. Each primary category encompasses multiple sub-tasks.
  • Figure 2: Dewarping results for modern documents. The red glowing indicates the relatively best output corresponding to each input. The overall dewarping results are inferior, where GPT-4o rectifies the image to be flat but loses some embedded graphics and text, while other methods even fail to perform dewarping and lose substantial textual details.
  • Figure 3: Deshadowing and deblurring results for modern documents. Flux.1-Kontext-dev exhibits the optimal deshadowing results in both language cases, with most texts preserved and shadow removed. Other models either fail to remove the shadow or mistakenly repeat the textual content. For document deblurring, Flux.1-Kontext-dev showcases nearly perfect results in the English scenario. GPT-4o performs better in the Chinese case with precise text restoration but fails to restore the document structure.
  • Figure 4: Appearance enhancement results for modern documents. Only GPT-4o and Flux.1-Kontext-dev successfully interpret the instruction of outputting "PDF-like" documents. Most models fail to preserve document structures, particularly evident in erroneous repetition of table contents.
  • Figure 5: Text editing results for modern documents. GPT-4o and Qwen-VLo-preview better follow the instructions compared to other unified models, especially in locating target text and conducting replacement. Flux.1-Kontext-dev significantly outperforms other specialized generation models.
  • ...and 16 more figures