Table of Contents
Fetching ...

Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation

Filipe Lauar, Valentin Laurent

TL;DR

This work tackles cross-language OCR for Visual Rich Documents by adapting TrOCR to Spanish through transfer learning, addressing data scarcity with a large synthetic VRD image-text generator and artifact-aware augmentations. It compares two transfer strategies: (i) English TrOCR encoder with a Spanish decoder, and (ii) fine-tuning the English TrOCR base on Spanish data, both with comparable parameter budgets. Evaluation on XFUND Spanish using $CER$ and $WER$ shows that fine-tuning the English TrOCR on Spanish outperforms the Spanish-decoder variant, achieving strong open-source performance and approaching cloud-based baselines. The authors release the Spanish TrOCR models and the dataset generator code, establishing a reproducible pipeline for high-quality, multilingual OCR in VRDs.

Abstract

This study explores the transfer learning capabilities of the TrOCR architecture to Spanish. TrOCR is a transformer-based Optical Character Recognition (OCR) model renowned for its state-of-the-art performance in English benchmarks. Inspired by Li et al. assertion regarding its adaptability to multilingual text recognition, we investigate two distinct approaches to adapt the model to a new language: integrating an English TrOCR encoder with a language specific decoder and train the model on this specific language, and fine-tuning the English base TrOCR model on a new language data. Due to the scarcity of publicly available datasets, we present a resource-efficient pipeline for creating OCR datasets in any language, along with a comprehensive benchmark of the different image generation methods employed with a focus on Visual Rich Documents (VRDs). Additionally, we offer a comparative analysis of the two approaches for the Spanish language, demonstrating that fine-tuning the English TrOCR on Spanish yields superior recognition than the language specific decoder for a fixed dataset size. We evaluate our model employing character and word error rate metrics on a public available printed dataset, comparing the performance against other open-source and cloud OCR spanish models. As far as we know, these resources represent the best open-source model for OCR in Spanish. The Spanish TrOCR models are publicly available on HuggingFace [20] and the code to generate the dataset is available on Github [25].

Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation

TL;DR

This work tackles cross-language OCR for Visual Rich Documents by adapting TrOCR to Spanish through transfer learning, addressing data scarcity with a large synthetic VRD image-text generator and artifact-aware augmentations. It compares two transfer strategies: (i) English TrOCR encoder with a Spanish decoder, and (ii) fine-tuning the English TrOCR base on Spanish data, both with comparable parameter budgets. Evaluation on XFUND Spanish using and shows that fine-tuning the English TrOCR on Spanish outperforms the Spanish-decoder variant, achieving strong open-source performance and approaching cloud-based baselines. The authors release the Spanish TrOCR models and the dataset generator code, establishing a reproducible pipeline for high-quality, multilingual OCR in VRDs.

Abstract

This study explores the transfer learning capabilities of the TrOCR architecture to Spanish. TrOCR is a transformer-based Optical Character Recognition (OCR) model renowned for its state-of-the-art performance in English benchmarks. Inspired by Li et al. assertion regarding its adaptability to multilingual text recognition, we investigate two distinct approaches to adapt the model to a new language: integrating an English TrOCR encoder with a language specific decoder and train the model on this specific language, and fine-tuning the English base TrOCR model on a new language data. Due to the scarcity of publicly available datasets, we present a resource-efficient pipeline for creating OCR datasets in any language, along with a comprehensive benchmark of the different image generation methods employed with a focus on Visual Rich Documents (VRDs). Additionally, we offer a comparative analysis of the two approaches for the Spanish language, demonstrating that fine-tuning the English TrOCR on Spanish yields superior recognition than the language specific decoder for a fixed dataset size. We evaluate our model employing character and word error rate metrics on a public available printed dataset, comparing the performance against other open-source and cloud OCR spanish models. As far as we know, these resources represent the best open-source model for OCR in Spanish. The Spanish TrOCR models are publicly available on HuggingFace [20] and the code to generate the dataset is available on Github [25].
Paper Structure (18 sections, 2 equations, 6 figures, 3 tables)

This paper contains 18 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Boxes.
  • Figure 2: Random horizontal and/or vertical lines.
  • Figure 3: Cropped text.
  • Figure 4: Few samples of the generated dataset, after the use of data augmentation.
  • Figure 5: : CER values achieved by the small, base and large versions of the English checkpoint of the TrOCR. Each data point represents the mean value over more than a hundred images of varying sentence lengths (measured in the number of characters). The error values were computed under the assumption that the CER distribution for each point follows a Gaussian distribution.
  • ...and 1 more figures