ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting
Chen Duan, Pei Fu, Shan Guo, Qianyi Jiang, Xiaoming Wei
TL;DR
ODM tackles the OCR alignment problem by introducing OCR-Text Destylization Modeling, which uses pixel-level reconstruction guided by text prompts to align OCR-Text with image features. A Text-Controller module regulates decoding to focus on OCR-Text, and a novel label-generation approach enables weakly supervised pre-training with unlabeled data. The training objective combines a segmentation-like loss, an OCR perceptual loss, and a batch-level contrastive loss to map text and image into a shared space: $L_{total}=\alpha L_{seg}+\beta L_{ocr}+\gamma L_{bc}$ with $(\alpha,\beta,\gamma)=(1,1,0.5)$. Extensive experiments on SynthText pre-training and fine-tuning on ICDAR15, CTW1500, TotalText, and LSVT demonstrate consistent improvements over existing pre-training methods for both scene text detection and spotting, with effective weakly supervised gains and robust ablations supporting the contributions.
Abstract
In recent years, text-image joint pre-training techniques have shown promising results in various tasks. However, in Optical Character Recognition (OCR) tasks, aligning text instances with their corresponding text regions in images poses a challenge, as it requires effective alignment between text and OCR-Text (referring to the text in images as OCR-Text to distinguish from the text in natural language) rather than a holistic understanding of the overall image content. In this paper, we propose a new pre-training method called OCR-Text Destylization Modeling (ODM) that transfers diverse styles of text found in images to a uniform style based on the text prompt. With ODM, we achieve better alignment between text and OCR-Text and enable pre-trained models to adapt to the complex and diverse styles of scene text detection and spotting tasks. Additionally, we have designed a new labeling generation method specifically for ODM and combined it with our proposed Text-Controller module to address the challenge of annotation costs in OCR tasks, allowing a larger amount of unlabeled data to participate in pre-training. Extensive experiments on multiple public datasets demonstrate that our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks. Code is available at https://github.com/PriNing/ODM.
