Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation
Filipe Lauar, Valentin Laurent
TL;DR
This work tackles cross-language OCR for Visual Rich Documents by adapting TrOCR to Spanish through transfer learning, addressing data scarcity with a large synthetic VRD image-text generator and artifact-aware augmentations. It compares two transfer strategies: (i) English TrOCR encoder with a Spanish decoder, and (ii) fine-tuning the English TrOCR base on Spanish data, both with comparable parameter budgets. Evaluation on XFUND Spanish using $CER$ and $WER$ shows that fine-tuning the English TrOCR on Spanish outperforms the Spanish-decoder variant, achieving strong open-source performance and approaching cloud-based baselines. The authors release the Spanish TrOCR models and the dataset generator code, establishing a reproducible pipeline for high-quality, multilingual OCR in VRDs.
Abstract
This study explores the transfer learning capabilities of the TrOCR architecture to Spanish. TrOCR is a transformer-based Optical Character Recognition (OCR) model renowned for its state-of-the-art performance in English benchmarks. Inspired by Li et al. assertion regarding its adaptability to multilingual text recognition, we investigate two distinct approaches to adapt the model to a new language: integrating an English TrOCR encoder with a language specific decoder and train the model on this specific language, and fine-tuning the English base TrOCR model on a new language data. Due to the scarcity of publicly available datasets, we present a resource-efficient pipeline for creating OCR datasets in any language, along with a comprehensive benchmark of the different image generation methods employed with a focus on Visual Rich Documents (VRDs). Additionally, we offer a comparative analysis of the two approaches for the Spanish language, demonstrating that fine-tuning the English TrOCR on Spanish yields superior recognition than the language specific decoder for a fixed dataset size. We evaluate our model employing character and word error rate metrics on a public available printed dataset, comparing the performance against other open-source and cloud OCR spanish models. As far as we know, these resources represent the best open-source model for OCR in Spanish. The Spanish TrOCR models are publicly available on HuggingFace [20] and the code to generate the dataset is available on Github [25].
