Table of Contents
Fetching ...

ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting

Chen Duan, Pei Fu, Shan Guo, Qianyi Jiang, Xiaoming Wei

TL;DR

ODM tackles the OCR alignment problem by introducing OCR-Text Destylization Modeling, which uses pixel-level reconstruction guided by text prompts to align OCR-Text with image features. A Text-Controller module regulates decoding to focus on OCR-Text, and a novel label-generation approach enables weakly supervised pre-training with unlabeled data. The training objective combines a segmentation-like loss, an OCR perceptual loss, and a batch-level contrastive loss to map text and image into a shared space: $L_{total}=\alpha L_{seg}+\beta L_{ocr}+\gamma L_{bc}$ with $(\alpha,\beta,\gamma)=(1,1,0.5)$. Extensive experiments on SynthText pre-training and fine-tuning on ICDAR15, CTW1500, TotalText, and LSVT demonstrate consistent improvements over existing pre-training methods for both scene text detection and spotting, with effective weakly supervised gains and robust ablations supporting the contributions.

Abstract

In recent years, text-image joint pre-training techniques have shown promising results in various tasks. However, in Optical Character Recognition (OCR) tasks, aligning text instances with their corresponding text regions in images poses a challenge, as it requires effective alignment between text and OCR-Text (referring to the text in images as OCR-Text to distinguish from the text in natural language) rather than a holistic understanding of the overall image content. In this paper, we propose a new pre-training method called OCR-Text Destylization Modeling (ODM) that transfers diverse styles of text found in images to a uniform style based on the text prompt. With ODM, we achieve better alignment between text and OCR-Text and enable pre-trained models to adapt to the complex and diverse styles of scene text detection and spotting tasks. Additionally, we have designed a new labeling generation method specifically for ODM and combined it with our proposed Text-Controller module to address the challenge of annotation costs in OCR tasks, allowing a larger amount of unlabeled data to participate in pre-training. Extensive experiments on multiple public datasets demonstrate that our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks. Code is available at https://github.com/PriNing/ODM.

ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting

TL;DR

ODM tackles the OCR alignment problem by introducing OCR-Text Destylization Modeling, which uses pixel-level reconstruction guided by text prompts to align OCR-Text with image features. A Text-Controller module regulates decoding to focus on OCR-Text, and a novel label-generation approach enables weakly supervised pre-training with unlabeled data. The training objective combines a segmentation-like loss, an OCR perceptual loss, and a batch-level contrastive loss to map text and image into a shared space: with . Extensive experiments on SynthText pre-training and fine-tuning on ICDAR15, CTW1500, TotalText, and LSVT demonstrate consistent improvements over existing pre-training methods for both scene text detection and spotting, with effective weakly supervised gains and robust ablations supporting the contributions.

Abstract

In recent years, text-image joint pre-training techniques have shown promising results in various tasks. However, in Optical Character Recognition (OCR) tasks, aligning text instances with their corresponding text regions in images poses a challenge, as it requires effective alignment between text and OCR-Text (referring to the text in images as OCR-Text to distinguish from the text in natural language) rather than a holistic understanding of the overall image content. In this paper, we propose a new pre-training method called OCR-Text Destylization Modeling (ODM) that transfers diverse styles of text found in images to a uniform style based on the text prompt. With ODM, we achieve better alignment between text and OCR-Text and enable pre-trained models to adapt to the complex and diverse styles of scene text detection and spotting tasks. Additionally, we have designed a new labeling generation method specifically for ODM and combined it with our proposed Text-Controller module to address the challenge of annotation costs in OCR tasks, allowing a larger amount of unlabeled data to participate in pre-training. Extensive experiments on multiple public datasets demonstrate that our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks. Code is available at https://github.com/PriNing/ODM.
Paper Structure (16 sections, 4 equations, 5 figures, 7 tables)

This paper contains 16 sections, 4 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparisons of different pre-training strategies. (a) Obtain the pre-trained model through mask image modeling, taking only image embeddings as inputs. (b) Obtain the pre-trained model through mask language modeling, which simultaneously takes both OCR-Text and image as inputs. (c) Our approach obtains the pre-trained model through OCR-Text destylization modeling.
  • Figure 2: The upper row and lower row represent the original images and their corresponding destylized labels, respectively. (a), (b), (c), and (d) are taken from the ICDAR15 karatzas2015icdar, CTW1500 liu2019curved, TotalText ch2020total, and LSVT sun2019icdar datasets, respectively.
  • Figure 3: The overall architecture of ODM. The text is encoded by the Text-Controller to get the encoded text features, and the image is encoded by the image encoder to get the encoded image features. The text features and image features interact through cross-attention, and finally output destylization binary image.
  • Figure 4: Illustration of the proposed Text-Controller Module: The attention heatmap (from the cross-attention layer) of the text branch under different input scenarios is depicted. (a)The original image. (b) The text input consists of three instances: "Rootin", "Ridge", and "Toymakers". (c) The "Toymakers" instance is discarded. (d) A non-existent instance "Sjehf" is added.
  • Figure 5: The upper row and lower row represent the original images and their corresponding predicted results, respectively.