Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR

Peirong Zhang; Haowei Xu; Jiaxin Zhang; Guitao Xu; Xuhan Zheng; Zhenhua Yang; Junle Liu; Yuyi Zhang; Lianwen Jin

Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR

Peirong Zhang, Haowei Xu, Jiaxin Zhang, Guitao Xu, Xuhan Zheng, Zhenhua Yang, Junle Liu, Yuyi Zhang, Lianwen Jin

TL;DR

This work tackles the lack of systematic evaluation for OCR-focused text image generation and editing by benchmarking seven state-of-the-art models across 33 tasks spanning documents, handwriting, scene text, artistic text, and complex layouts in English and Chinese. It reveals widespread limitations in text localization, content preservation, multilingual rendering, and layout integrity, with closed-source models like GPT-4o generally outperforming open-source ones. The authors argue that photorealistic text generation should be internalized as a foundational capability of general-domain generative models rather than relegated to specialized solutions, and they publish an online, continuously updated GitHub evaluation to guide future research. The study provides actionable insights for building more robust OCR-capable generation/editing abilities into broad-model architectures to enhance downstream OCR and information-extraction tasks.

Abstract

Text image is a unique and crucial information medium that integrates visual aesthetics and linguistic semantics in modern e-society. Due to their subtlety and complexity, the generation of text images represents a challenging and evolving frontier in the image generation field. The recent surge of specialized image generators (\emph{e.g.}, Flux-series) and unified generative models (\emph{e.g.}, GPT-4o), which demonstrate exceptional fidelity, raises a natural question: can they master the intricacies of text image generation and editing? Motivated by this, we assess current state-of-the-art generative models' capabilities in terms of text image generation and editing. We incorporate various typical optical character recognition (OCR) tasks into our evaluation and broaden the concept of text-based generation tasks into OCR generative tasks. We select 33 representative tasks and categorize them into five categories: document, handwritten text, scene text, artistic text, and complex \& layout-rich text. For comprehensive evaluation, we examine six models across both closed-source and open-source domains, using tailored, high-quality image inputs and prompts. Through this evaluation, we draw crucial observations and identify the weaknesses of current generative models for OCR tasks. We argue that photorealistic text image generation and editing should be internalized as foundational skills into general-domain generative models, rather than being delegated to specialized solutions, and we hope this empirical analysis can provide valuable insights for the community to achieve this goal. This evaluation is online and will be continuously updated at our GitHub repository.

Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR

TL;DR

Abstract

Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (21)