Table of Contents
Fetching ...

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

Sankalp Nagaonkar, Augustya Sharma, Ashish Choithani, Ashutosh Trivedi

TL;DR

The paper tackles OCR in dynamic video environments by benchmarking Vision-Language Models (Claude-3, Gemini-1.5, GPT-4o) against traditional OCR systems (EasyOCR, RapidOCR) on a newly released 1,477-frame dataset spanning diverse domains. It introduces an open-source benchmarking pipeline via VideoDB and evaluates using Word Error Rate (WER), Character Error Rate (CER), and Accuracy, revealing that VLMs often outperform traditional OCR but can hallucinate or trigger content-security policies. Among the VLMs, GPT-4o achieves the highest overall accuracy, Gemini-1.5 Pro yields the lowest WER, while processing speed varies across models; nevertheless, all models face challenges with occluded or stylized text. The work demonstrates the practical potential of VLMs for video OCR and provides publicly available resources to accelerate research and robust evaluation in multimodal OCR tasks.

Abstract

This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

TL;DR

The paper tackles OCR in dynamic video environments by benchmarking Vision-Language Models (Claude-3, Gemini-1.5, GPT-4o) against traditional OCR systems (EasyOCR, RapidOCR) on a newly released 1,477-frame dataset spanning diverse domains. It introduces an open-source benchmarking pipeline via VideoDB and evaluates using Word Error Rate (WER), Character Error Rate (CER), and Accuracy, revealing that VLMs often outperform traditional OCR but can hallucinate or trigger content-security policies. Among the VLMs, GPT-4o achieves the highest overall accuracy, Gemini-1.5 Pro yields the lowest WER, while processing speed varies across models; nevertheless, all models face challenges with occluded or stylized text. The work demonstrates the practical potential of VLMs for video OCR and provides publicly available resources to accelerate research and robust evaluation in multimodal OCR tasks.

Abstract

This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.

Paper Structure

This paper contains 15 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: Handwritten and Occlulded text (Additons and Substitutions are marked in Red)
  • Figure 2: TV Commercial (Additions and Substitutions are marked in Red)
  • Figure 3: Finance/Business/News Text
  • Figure 4: Handwritten Text
  • Figure 5: Legal/Educational Text
  • ...and 8 more figures