Table of Contents
Fetching ...

Mixed Text Recognition with Efficient Parameter Fine-Tuning and Transformer

Da Chang, Yu Li

TL;DR

This work tackles mixed-text OCR, where handwritten, printed, and scene text must be recognized with limited computational resources. It introduces DLoRA-TrOCR, a parameter-efficient hybrid text spotting model that applies weight-decomposed DoRA to the image encoder and LoRA to the text decoder, enabling fine-tuning with a fraction of parameters while maintaining high accuracy. Evaluations on IAM, SROIE, STR, and a constructed mixed dataset show strong results, including IAM CER $4.02$, SROIE F1 $94.29$, STR WAR $86.70$, and mixed-dataset WAR $88.07$, demonstrating improved generalization and efficiency over full fine-tuning and other PEFT methods. The findings highlight the value of selectively applying PEFT to encoder and decoder components in Transformer-based OCR, offering practical guidance for deploying large-scale OCR models in multi-scene contexts.

Abstract

With the rapid development of OCR technology, mixed-scene text recognition has become a key technical challenge. Although deep learning models have achieved significant results in specific scenarios, their generality and stability still need improvement, and the high demand for computing resources affects flexibility. To address these issues, this paper proposes DLoRA-TrOCR, a parameter-efficient hybrid text spotting method based on a pre-trained OCR Transformer. By embedding a weight-decomposed DoRA module in the image encoder and a LoRA module in the text decoder, this method can be efficiently fine-tuned on various downstream tasks. Our method requires no more than 0.7\% trainable parameters, not only accelerating the training efficiency but also significantly improving the recognition accuracy and cross-dataset generalization performance of the OCR system in mixed text scenes. Experiments show that our proposed DLoRA-TrOCR outperforms other parameter-efficient fine-tuning methods in recognizing complex scenes with mixed handwritten, printed, and street text, achieving a CER of 4.02 on the IAM dataset, a F1 score of 94.29 on the SROIE dataset, and a WAR of 86.70 on the STR Benchmark, reaching state-of-the-art performance.

Mixed Text Recognition with Efficient Parameter Fine-Tuning and Transformer

TL;DR

This work tackles mixed-text OCR, where handwritten, printed, and scene text must be recognized with limited computational resources. It introduces DLoRA-TrOCR, a parameter-efficient hybrid text spotting model that applies weight-decomposed DoRA to the image encoder and LoRA to the text decoder, enabling fine-tuning with a fraction of parameters while maintaining high accuracy. Evaluations on IAM, SROIE, STR, and a constructed mixed dataset show strong results, including IAM CER , SROIE F1 , STR WAR , and mixed-dataset WAR , demonstrating improved generalization and efficiency over full fine-tuning and other PEFT methods. The findings highlight the value of selectively applying PEFT to encoder and decoder components in Transformer-based OCR, offering practical guidance for deploying large-scale OCR models in multi-scene contexts.

Abstract

With the rapid development of OCR technology, mixed-scene text recognition has become a key technical challenge. Although deep learning models have achieved significant results in specific scenarios, their generality and stability still need improvement, and the high demand for computing resources affects flexibility. To address these issues, this paper proposes DLoRA-TrOCR, a parameter-efficient hybrid text spotting method based on a pre-trained OCR Transformer. By embedding a weight-decomposed DoRA module in the image encoder and a LoRA module in the text decoder, this method can be efficiently fine-tuned on various downstream tasks. Our method requires no more than 0.7\% trainable parameters, not only accelerating the training efficiency but also significantly improving the recognition accuracy and cross-dataset generalization performance of the OCR system in mixed text scenes. Experiments show that our proposed DLoRA-TrOCR outperforms other parameter-efficient fine-tuning methods in recognizing complex scenes with mixed handwritten, printed, and street text, achieving a CER of 4.02 on the IAM dataset, a F1 score of 94.29 on the SROIE dataset, and a WAR of 86.70 on the STR Benchmark, reaching state-of-the-art performance.
Paper Structure (17 sections, 4 equations, 4 figures, 5 tables)

This paper contains 17 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Optical Character Recognition: Text Detection and Text Recognition.
  • Figure 2: Figure 2.(a) shows the transformer-based architecture, where the encoder-decoder model consists of a pre-trained image transformer as the encoder and a pre-trained text transformer as the decoder. The model is pre-trained in two stages on a synthetic dataset containing millions of handwritten and printed texts. The overall framework of our model is based on TrOCR li2023trocr. Figures 2.(b) and 2.(c) illustrate the schematic diagrams of the DoRA and LoRA methods, respectively. As shown, DoRA decomposes the pre-trained weights into directions and magnitudes, updating the increments using LoRA in the direction before computing with magnitudes. The LoRA method approximates the weight update using a low-rank matrix.
  • Figure 3: The heatmap corresponds to low-rank values of 1, 2, 4, 8, and 16 alongside alpha values of 1, 2, 4, 8, 16, and 32. In Figure 3.(a), the color intensity represents the CER score on the validation set; a deeper blue indicates superior performance. Conversely, in Figure 3.(b), the colors reflect the F1 score on the validation set where more intense red hues signify better outcomes.
  • Figure 4: The impact of DoRA, LoRA and Fine-Tune(FT) methods on the performance of our model's encoder and decoder modules, respectively. For comparison purposes, the metric used here is Character Accuracy Rate, defined as ($1-\mathrm{CER})* 100\%$.