Table of Contents
Fetching ...

JSTR: Judgment Improves Scene Text Recognition

Masato Fujitake

TL;DR

JSTR tackles scene text recognition by introducing a judgment mechanism that assesses whether the image-text pair is correct, in addition to standard recognition. It builds on the DTrOCR baseline and adds a correct/incorrect judgement task that uses the recognizer's outputs to generate misrecognition examples. The model is trained in two steps: first for text recognition, then for correctness judgment, enabling it to learn error patterns and improve discrimination on hard cases. Experimental results on six benchmarks show improved word-level accuracy and competitive performance with state-of-the-art methods, with larger gains when trained on real-world data, highlighting practical robustness.

Abstract

In this paper, we present a method for enhancing the accuracy of scene text recognition tasks by judging whether the image and text match each other. While previous studies focused on generating the recognition results from input images, our approach also considers the model's misrecognition results to understand its error tendencies, thus improving the text recognition pipeline. This method boosts text recognition accuracy by providing explicit feedback on the data that the model is likely to misrecognize by predicting correct or incorrect between the image and text. The experimental results on publicly available datasets demonstrate that our proposed method outperforms the baseline and state-of-the-art methods in scene text recognition.

JSTR: Judgment Improves Scene Text Recognition

TL;DR

JSTR tackles scene text recognition by introducing a judgment mechanism that assesses whether the image-text pair is correct, in addition to standard recognition. It builds on the DTrOCR baseline and adds a correct/incorrect judgement task that uses the recognizer's outputs to generate misrecognition examples. The model is trained in two steps: first for text recognition, then for correctness judgment, enabling it to learn error patterns and improve discrimination on hard cases. Experimental results on six benchmarks show improved word-level accuracy and competitive performance with state-of-the-art methods, with larger gains when trained on real-world data, highlighting practical robustness.

Abstract

In this paper, we present a method for enhancing the accuracy of scene text recognition tasks by judging whether the image and text match each other. While previous studies focused on generating the recognition results from input images, our approach also considers the model's misrecognition results to understand its error tendencies, thus improving the text recognition pipeline. This method boosts text recognition accuracy by providing explicit feedback on the data that the model is likely to misrecognize by predicting correct or incorrect between the image and text. The experimental results on publicly available datasets demonstrate that our proposed method outperforms the baseline and state-of-the-art methods in scene text recognition.
Paper Structure (11 sections, 3 figures, 2 tables)

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Differences in approaches between previous studies and the proposed method. In the previous works, a model is formed to output recognition results from an image. On the other hand, the proposed method performs modeling to judge whether the image and the text match each other in the same model, in addition to the modeling text recognition in the previous studies. If the image and text content match, the model judges the image as correct, and if not, the model makes the recognition wrong. This improves recognition accuracy by strengthening the connection between the image and text.
  • Figure 2: The flow of the data pipeline in the proposed method. The left column shows the input, model, output, and ground truth during training. Image, Recognition, GT, Bool, True, False, and Miss-recognition are the image, text recognition result, correct text of ground-truth, bool result of judgment, true, false, and misrecognition text, respectively.
  • Figure 3: Comparison of recognition results in baseline (DTrOCR fujitake2023dtrocr) and proposed methods. The leftmost column is the input image, and each column shows the recognition results for each method. Black characters indicate cases that match the ground truth, and red ones indicate misrecognition cases. Comparison with the baseline confirms that the proposed method is robust in hard-to-read cases because it learns misrecognition tendencies.