Table of Contents
Fetching ...

Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems

Ali Faraz, Raja Kolla, Ashish Kulkarni, Shubham Agarwal

TL;DR

Addressing production-scale OCR for India's linguistic diversity and document heterogeneity, the paper compares two Vision–Language Model strategies: a LLaVA-style end-to-end multilingual OCR (Chitrapathak-1) and fine-tuning an OCR-specialized model (Chitrapathak-2). It also introduces Parichay, a domain-specific OCR series for structured extraction from government documents. Across multilingual Indic benchmarks, fine-tuning outperforms end-to-end training, delivering a $3{-}6\times$ decrease in latency and achieving $6.69$ character ANLS for Telugu, while Parichay attains $89.8%$ exact match with fast inference (~1.03 s/document). Together, these results offer practical guidance on when to favor general VLM adaptation versus task-specific fine-tuning, with actionable deployment lessons for production-scale OCR pipelines in India.

Abstract

Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong multilingual language model and training the system end-to-end for OCR. Alternatively, we explore fine-tuning an existing OCR model, despite not being trained for the target languages. Through extensive evaluation on multilingual Indic OCR benchmarks and deployment-oriented metrics, we find that the second strategy consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over its predecessor with being state-of-the-art (SOTA) in Telugu (6.69 char ANLS) and second best in the rest. In addition, we present Parichay, an independent OCR model series designed specifically for 9 Indian government documents to extract structured key fields, achieving 89.8% Exact Match score with a faster inference. Together, these systems achieve SOTA performance and provide practical guidance for building production-scale OCR pipelines in the Indian context.

Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems

TL;DR

Addressing production-scale OCR for India's linguistic diversity and document heterogeneity, the paper compares two Vision–Language Model strategies: a LLaVA-style end-to-end multilingual OCR (Chitrapathak-1) and fine-tuning an OCR-specialized model (Chitrapathak-2). It also introduces Parichay, a domain-specific OCR series for structured extraction from government documents. Across multilingual Indic benchmarks, fine-tuning outperforms end-to-end training, delivering a decrease in latency and achieving character ANLS for Telugu, while Parichay attains exact match with fast inference (~1.03 s/document). Together, these results offer practical guidance on when to favor general VLM adaptation versus task-specific fine-tuning, with actionable deployment lessons for production-scale OCR pipelines in India.

Abstract

Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong multilingual language model and training the system end-to-end for OCR. Alternatively, we explore fine-tuning an existing OCR model, despite not being trained for the target languages. Through extensive evaluation on multilingual Indic OCR benchmarks and deployment-oriented metrics, we find that the second strategy consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over its predecessor with being state-of-the-art (SOTA) in Telugu (6.69 char ANLS) and second best in the rest. In addition, we present Parichay, an independent OCR model series designed specifically for 9 Indian government documents to extract structured key fields, achieving 89.8% Exact Match score with a faster inference. Together, these systems achieve SOTA performance and provide practical guidance for building production-scale OCR pipelines in the Indian context.
Paper Structure (27 sections, 1 equation, 5 figures, 11 tables)

This paper contains 27 sections, 1 equation, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Overview of the two complementary strategies explored in our work in Section \ref{['sec:chitrapathak']}. We find strategy 2 of finetuning existing model to be data efficient and performing better for multilingual and domain adaptation.
  • Figure 2: OCR outputs for Hindi (left) and Sanskrit (right) languages from Chitrapathak-2. More examples in Appendix.
  • Figure 3: Example document image from the SROIE dataset.
  • Figure 4: OCR outputs for Odia (left) and Malayalam (right) languages from Chitrapathak-2
  • Figure 5: Examples of limitations of Chitrapathak-2. The left image shows an index-page in Hindi while the right image shows a page in English with a rare/old way of writing the letter 's' which the model consistently reads as 'f'.