Table of Contents
Fetching ...

NCAP: Scene Text Image Super-Resolution with Non-CAtegorical Prior

Dongwoo Park, Suk Pil Ko

TL;DR

This work tackles two problems in scene text image super-resolution (STISR): instability from explicit text priors and the domain gap between low- and high-resolution images when jointly training recognizers with STISR. It introduces Non-CAtegorical Prior (NCAP), which uses penultimate-layer representations processed by adapters as a category-free, information-rich prior, incurring only about 0.3% extra parameters. It further mitigates overconfidence by mixing hard ground-truth labels with soft teacher labels via a temperature-scaled KL divergence together with cross-entropy loss, formalized as $\mathcal{L}=(1-\alpha)\mathcal{L}_{CE} + \alpha\mathcal{L}_{KL}(p^s(\tau),p^t(\tau))$ with $\tau$ controlling distribution sharpness. Experiments on TextZoom show a $3.5\%$ improvement, and cross-dataset STR evaluation demonstrates a $14.8\%$ generalization gain, with NCAP-compatible gains across TP-guided STISR models, validating broad applicability and robustness.

Abstract

Scene text image super-resolution (STISR) enhances the resolution and quality of low-resolution images. Unlike previous studies that treated scene text images as natural images, recent methods using a text prior (TP), extracted from a pre-trained text recognizer, have shown strong performance. However, two major issues emerge: (1) Explicit categorical priors, like TP, can negatively impact STISR if incorrect. We reveal that these explicit priors are unstable and propose replacing them with Non-CAtegorical Prior (NCAP) using penultimate layer representations. (2) Pre-trained recognizers used to generate TP struggle with low-resolution images. To address this, most studies jointly train the recognizer with the STISR network to bridge the domain gap between low- and high-resolution images, but this can cause an overconfidence phenomenon in the prior modality. We highlight this issue and propose a method to mitigate it by mixing hard and soft labels. Experiments on the TextZoom dataset demonstrate an improvement by 3.5%, while our method significantly enhances generalization performance by 14.8\% across four text recognition datasets. Our method generalizes to all TP-guided STISR networks.

NCAP: Scene Text Image Super-Resolution with Non-CAtegorical Prior

TL;DR

This work tackles two problems in scene text image super-resolution (STISR): instability from explicit text priors and the domain gap between low- and high-resolution images when jointly training recognizers with STISR. It introduces Non-CAtegorical Prior (NCAP), which uses penultimate-layer representations processed by adapters as a category-free, information-rich prior, incurring only about 0.3% extra parameters. It further mitigates overconfidence by mixing hard ground-truth labels with soft teacher labels via a temperature-scaled KL divergence together with cross-entropy loss, formalized as with controlling distribution sharpness. Experiments on TextZoom show a improvement, and cross-dataset STR evaluation demonstrates a generalization gain, with NCAP-compatible gains across TP-guided STISR models, validating broad applicability and robustness.

Abstract

Scene text image super-resolution (STISR) enhances the resolution and quality of low-resolution images. Unlike previous studies that treated scene text images as natural images, recent methods using a text prior (TP), extracted from a pre-trained text recognizer, have shown strong performance. However, two major issues emerge: (1) Explicit categorical priors, like TP, can negatively impact STISR if incorrect. We reveal that these explicit priors are unstable and propose replacing them with Non-CAtegorical Prior (NCAP) using penultimate layer representations. (2) Pre-trained recognizers used to generate TP struggle with low-resolution images. To address this, most studies jointly train the recognizer with the STISR network to bridge the domain gap between low- and high-resolution images, but this can cause an overconfidence phenomenon in the prior modality. We highlight this issue and propose a method to mitigate it by mixing hard and soft labels. Experiments on the TextZoom dataset demonstrate an improvement by 3.5%, while our method significantly enhances generalization performance by 14.8\% across four text recognition datasets. Our method generalizes to all TP-guided STISR networks.

Paper Structure

This paper contains 17 sections, 8 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Examples illustrating the negative impact of prior knowledge on an STISR task. (a), (b), and (c) of w/ and w/o Ours refer to the results of using CRNN shi2016end as a prior generator (pre-trained text recognizer) in TATT ma2022text, PARSeq bautista2022scene as a prior generator in TATT ma2022text, and LEMMA guo2023towards, respectively. Even with various STISR networks and a prior generator, the wrong guidance of the explicit prior still appears. Blue indicates the characters that can be influenced by prior knowledge in the STISR results. Red indicates wrong recognition results. Without our method, prior knowledge negatively influences the STISR results; however, with our proposed method, this negative influence can be effectively eliminated. Prior refers to the argmaxed text prior and SR refers to the recognition result of the SR image.
  • Figure 2: Overall architecture. We enhance the previous TP-guided STISR network by introducing a loss function that incorporates linear combinations of hard labels and soft labels, along with NCAP, which utilizes penultimate layer representations as prior knowledge.
  • Figure 3: Word- and character-level reliability diagram. LEMMA guo2023towards is the result visualized with the pre-trained weights of the official code, LEMMA$^{*}$ corresponds to the result of re-training the model using the official code of LEMMA guo2023towards, LS he2015delving represents a label smoothing technique, and Distillation represents the result of training a model using a loss function that eliminates the learning process with hard labels, opting to learn from soft labels instead. Ours is the result of a model trained with a linear combination loss of hard and soft labels. Please refer to the supplementary materials for the calculation of character-level reliability.
  • Figure 4: Results of character-level distribution difference of the text logits for training data by each loss function. (a) is a linear combination of softened KL divergence loss and cross-entropy loss, (b) involves KL divergence loss along with MAE loss, which are used in TPGSR ma2023text and TATT ma2022text, and (c) corresponds to the cross-entropy loss used in LEMMA guo2023towards.
  • Figure 5: Visualization of SR images and recognition result on TextZoom wang2020scene by CRNN shi2016end. Red indicates wrong recognition results.