Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation

Filipe Lauar; Valentin Laurent

Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation

Filipe Lauar, Valentin Laurent

TL;DR

This work tackles cross-language OCR for Visual Rich Documents by adapting TrOCR to Spanish through transfer learning, addressing data scarcity with a large synthetic VRD image-text generator and artifact-aware augmentations. It compares two transfer strategies: (i) English TrOCR encoder with a Spanish decoder, and (ii) fine-tuning the English TrOCR base on Spanish data, both with comparable parameter budgets. Evaluation on XFUND Spanish using $CER$ and $WER$ shows that fine-tuning the English TrOCR on Spanish outperforms the Spanish-decoder variant, achieving strong open-source performance and approaching cloud-based baselines. The authors release the Spanish TrOCR models and the dataset generator code, establishing a reproducible pipeline for high-quality, multilingual OCR in VRDs.

Abstract

This study explores the transfer learning capabilities of the TrOCR architecture to Spanish. TrOCR is a transformer-based Optical Character Recognition (OCR) model renowned for its state-of-the-art performance in English benchmarks. Inspired by Li et al. assertion regarding its adaptability to multilingual text recognition, we investigate two distinct approaches to adapt the model to a new language: integrating an English TrOCR encoder with a language specific decoder and train the model on this specific language, and fine-tuning the English base TrOCR model on a new language data. Due to the scarcity of publicly available datasets, we present a resource-efficient pipeline for creating OCR datasets in any language, along with a comprehensive benchmark of the different image generation methods employed with a focus on Visual Rich Documents (VRDs). Additionally, we offer a comparative analysis of the two approaches for the Spanish language, demonstrating that fine-tuning the English TrOCR on Spanish yields superior recognition than the language specific decoder for a fixed dataset size. We evaluate our model employing character and word error rate metrics on a public available printed dataset, comparing the performance against other open-source and cloud OCR spanish models. As far as we know, these resources represent the best open-source model for OCR in Spanish. The Spanish TrOCR models are publicly available on HuggingFace [20] and the code to generate the dataset is available on Github [25].

Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation

TL;DR

and

shows that fine-tuning the English TrOCR on Spanish outperforms the Spanish-decoder variant, achieving strong open-source performance and approaching cloud-based baselines. The authors release the Spanish TrOCR models and the dataset generator code, establishing a reproducible pipeline for high-quality, multilingual OCR in VRDs.

Abstract

Paper Structure (18 sections, 2 equations, 6 figures, 3 tables)

This paper contains 18 sections, 2 equations, 6 figures, 3 tables.

Introduction
Related Work
Models
Datasets
Methodology
VRDs image-text generation dataset
Models
Results
Image generation augmentation benchmarking
English fine-tuning vs. Spanish-decoder fine-tuning
Comprehensive vs. No Augmentation
Elastic deformation vs. No elastic deformation
Artifacts vs. No artifacts
Handwritten generation model: with vs. without
XFUND Spanish dataset
...and 3 more sections

Figures (6)

Figure 1: Boxes.
Figure 2: Random horizontal and/or vertical lines.
Figure 3: Cropped text.
Figure 4: Few samples of the generated dataset, after the use of data augmentation.
Figure 5: : CER values achieved by the small, base and large versions of the English checkpoint of the TrOCR. Each data point represents the mean value over more than a hundred images of varying sentence lengths (measured in the number of characters). The error values were computed under the assumption that the CER distribution for each point follows a Gaussian distribution.
...and 1 more figures

Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation

TL;DR

Abstract

Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)