Table of Contents
Fetching ...

Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity

Kotaro Inoue

TL;DR

This study probes context-independent OCR using multimodal LLMs by examining single-character Kanji recognition across four image resolutions, linking performance to image quality and character complexity. Using 2,136 kanji with a fractal-dimension and Shannon-entropy based complexity score, the authors create a 400-image evaluation set and compare GPT-4o, Gemini2.0-Flash, and Azure OCR. They find that multimodal LLMs match traditional OCR at about 300 ppi but lose accuracy below 150 ppi, with a small subset of characters repeatedly misread, and only weak correlations between misrecognition and visual complexity. The results imply that reliable character-level OCR with multimodal LLMs depends on image resolution and may require preprocessing or dedicated sub-models for high-precision tasks, guiding practical deployment and future analyses of encoder capabilities.

Abstract

Due to their high versatility in tasks such as image captioning, document analysis, and automated content generation, multimodal Large Language Models (LLMs) have attracted significant attention across various industrial fields. In particular, they have been shown to surpass specialized models in Optical Character Recognition (OCR). Nevertheless, their performance under different image conditions remains insufficiently investigated, and individual character recognition is not guaranteed due to their reliance on contextual cues. In this work, we examine a context-independent OCR task using single-character images with diverse visual complexities to determine the conditions for accurate recognition. Our findings reveal that multimodal LLMs can match conventional OCR methods at about 300 ppi, yet their performance deteriorates significantly below 150 ppi. Additionally, we observe a very weak correlation between visual complexity and misrecognitions, whereas a conventional OCR-specific model exhibits no correlation. These results suggest that image resolution and visual complexity may play an important role in the reliable application of multimodal LLMs to OCR tasks that require precise character-level accuracy.

Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity

TL;DR

This study probes context-independent OCR using multimodal LLMs by examining single-character Kanji recognition across four image resolutions, linking performance to image quality and character complexity. Using 2,136 kanji with a fractal-dimension and Shannon-entropy based complexity score, the authors create a 400-image evaluation set and compare GPT-4o, Gemini2.0-Flash, and Azure OCR. They find that multimodal LLMs match traditional OCR at about 300 ppi but lose accuracy below 150 ppi, with a small subset of characters repeatedly misread, and only weak correlations between misrecognition and visual complexity. The results imply that reliable character-level OCR with multimodal LLMs depends on image resolution and may require preprocessing or dedicated sub-models for high-precision tasks, guiding practical deployment and future analyses of encoder capabilities.

Abstract

Due to their high versatility in tasks such as image captioning, document analysis, and automated content generation, multimodal Large Language Models (LLMs) have attracted significant attention across various industrial fields. In particular, they have been shown to surpass specialized models in Optical Character Recognition (OCR). Nevertheless, their performance under different image conditions remains insufficiently investigated, and individual character recognition is not guaranteed due to their reliance on contextual cues. In this work, we examine a context-independent OCR task using single-character images with diverse visual complexities to determine the conditions for accurate recognition. Our findings reveal that multimodal LLMs can match conventional OCR methods at about 300 ppi, yet their performance deteriorates significantly below 150 ppi. Additionally, we observe a very weak correlation between visual complexity and misrecognitions, whereas a conventional OCR-specific model exhibits no correlation. These results suggest that image resolution and visual complexity may play an important role in the reliable application of multimodal LLMs to OCR tasks that require precise character-level accuracy.

Paper Structure

This paper contains 8 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Examples of the evaluation kanji dataset
  • Figure 2: Comparison of OCR accuracy at each resolution
  • Figure 3: Relationship between visual complexity and misrecognition frequency for each model
  • Figure 4: All evaluation kanji dataset