Table of Contents
Fetching ...

TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance

Keren Ye, Ignacio Garcia Dorado, Michalis Raptis, Mauricio Delbracio, Irene Zhu, Peyman Milanfar, Hossein Talebi

TL;DR

TextSR addresses a key gap in diffusion-based scene text super-resolution by integrating OCR-derived multilingual priors into a region-focused diffusion framework. It encodes multilingual text as UTF-8 via ByT5 and fuses these priors with low-resolution image features through cross-attention, enabling a single model to handle multiple languages. The paper introduces two robustness strategies: classifier-free dual-condition guidance with an adjustable weight and iterative OCR conditioning to progressively refine text recognition and restoration. Trained on 18 million text crops across seven datasets, TextSR achieves state-of-the-art results on TextZoom and TextVQA, demonstrating improved text fidelity and legibility under OCR noise, with practical impact for downstream tasks like visual question answering on multilingual scenes.

Abstract

While recent advancements in Image Super-Resolution (SR) using diffusion models have shown promise in improving overall image quality, their application to scene text images has revealed limitations. These models often struggle with accurate text region localization and fail to effectively model image and multilingual character-to-shape priors. This leads to inconsistencies, the generation of hallucinated textures, and a decrease in the perceived quality of the super-resolved text. To address these issues, we introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Scene Text Image Super-Resolution. TextSR leverages a text detector to pinpoint text regions within an image and then employs Optical Character Recognition (OCR) to extract multilingual text from these areas. The extracted text characters are then transformed into visual shapes using a UTF-8 based text encoder and cross-attention. Recognizing that OCR may sometimes produce inaccurate results in real-world scenarios, we have developed two innovative methods to enhance the robustness of our model. By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process, enhancing fine details within the text and improving overall legibility. The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR, underscoring the efficacy of our approach.

TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance

TL;DR

TextSR addresses a key gap in diffusion-based scene text super-resolution by integrating OCR-derived multilingual priors into a region-focused diffusion framework. It encodes multilingual text as UTF-8 via ByT5 and fuses these priors with low-resolution image features through cross-attention, enabling a single model to handle multiple languages. The paper introduces two robustness strategies: classifier-free dual-condition guidance with an adjustable weight and iterative OCR conditioning to progressively refine text recognition and restoration. Trained on 18 million text crops across seven datasets, TextSR achieves state-of-the-art results on TextZoom and TextVQA, demonstrating improved text fidelity and legibility under OCR noise, with practical impact for downstream tasks like visual question answering on multilingual scenes.

Abstract

While recent advancements in Image Super-Resolution (SR) using diffusion models have shown promise in improving overall image quality, their application to scene text images has revealed limitations. These models often struggle with accurate text region localization and fail to effectively model image and multilingual character-to-shape priors. This leads to inconsistencies, the generation of hallucinated textures, and a decrease in the perceived quality of the super-resolved text. To address these issues, we introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Scene Text Image Super-Resolution. TextSR leverages a text detector to pinpoint text regions within an image and then employs Optical Character Recognition (OCR) to extract multilingual text from these areas. The extracted text characters are then transformed into visual shapes using a UTF-8 based text encoder and cross-attention. Recognizing that OCR may sometimes produce inaccurate results in real-world scenarios, we have developed two innovative methods to enhance the robustness of our model. By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process, enhancing fine details within the text and improving overall legibility. The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR, underscoring the efficacy of our approach.

Paper Structure

This paper contains 17 sections, 6 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Our single model, equipped with multilingual character-to-shape diffusion priors, can super resolve low-resolution images in various languages and enhance both visual quality and legibility of text. This capability is lacking in current state-of-the-art super-resolution models (e.g., SUPIR Yu_2024_CVPR, PASD yang2023pixel).
  • Figure 2: MLLMs were unable to accurately describe text content. Our method takes OCR detection results and recognized text contents. SUPIR Yu_2024_CVPR takes the full images and the following LLaVa Liu_2024_CVPR prompts: The image features a wooden table with a collection of books and pamphlets placed on top of it. There are three books in total, with one book being larger and positioned in the center of the table, while the other two books are smaller and located on the left and right sides of the table.
  • Figure 3: Model architecture and training pipeline.
  • Figure 4: Qualitative results showing the learned multilingual character-to-shape priors (from top to down: English, Chinese, Japanese, and Hindi). The visualization was generated by setting $\omega$ to 10.0 for the $<$LQ, text$>$-conditioned model and 15.0 for the text-only model. For the text-only model, $c_I$ was set to $\varnothing$, and three random seeds were applied. The HQ images were degraded by blurring, adding noise, and downsampling to create the three LQ images for each language group.
  • Figure 5: Qualitative results on the TextVQA. State-of-the-art methods were ineffective in the presence of foreign texts.
  • ...and 6 more figures