Optical Text Recognition in Nepali and Bengali: A Transformer-based Approach
S M Rakib Hasan, Aakar Dhakal, Md Humaion Kabir Mehedi, Annajiat Alim Rasel
TL;DR
This work addresses OCR for Bengali and Nepali in low-resource settings by deploying a Transformer-based pipeline (TrOCR) with a ViT encoder and a multilingual xlm-roberta-base decoder to handle both handwritten and printed text. The model is trained on language-specific data, notably the BanglaWriting Bengali dataset and a small Nepali dataset, achieving a low Character Error Rate (CER) of 0.04 (Bengali) and 0.10 (Nepali) and a Word Error Rate (WER) of 0.10 (Bengali) and 0.14 (Nepali) on training, with test performance of CER 0.07/0.11 and WER 0.12/0.15, respectively. The results demonstrate the feasibility and effectiveness of transformer-based OCR for South Asian scripts, supporting digitization and computational linguistics in the region. The approach highlights the potential for scalable OCR in low-resource languages and paves the way for further enhancement through larger datasets and domain adaptation.
Abstract
Efforts on the research and development of OCR systems for Low-Resource Languages are relatively new. Low-resource languages have little training data available for training Machine Translation systems or other systems. Even though a vast amount of text has been digitized and made available on the internet the text is still in PDF and Image format, which are not instantly accessible. This paper discusses text recognition for two scripts: Bengali and Nepali; there are about 300 and 40 million Bengali and Nepali speakers respectively. In this study, using encoder-decoder transformers, a model was developed, and its efficacy was assessed using a collection of optical text images, both handwritten and printed. The results signify that the suggested technique corresponds with current approaches and achieves high precision in recognizing text in Bengali and Nepali. This study can pave the way for the advanced and accessible study of linguistics in South East Asia.
