Table of Contents
Fetching ...

Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing

Boqiang Zhang, Hongtao Xie, Zuan Gao, Yuxin Wang

TL;DR

DARLING tackles the problem of entangled representations in scene text by disentangling style and content into separate features using synthetic image pairs with identical style but different content. The framework employs a decoupling block and a multi-task decoder that jointly supports STR, STE, and STRM, guided by an alignment loss on style features and a content-focused recognition loss, with a gated injection mechanism to fuse information for generation. The key contributions are (1) a disentangled pre-training paradigm, (2) a decoupled feature architecture, (3) a unified multi-task decoder, and (4) a synthetic paired dataset for robust evaluation, achieving state-of-the-art results across STR, STE, and STRM. This approach enables more adaptable, high-quality scene text processing and sets a foundation for applying style-content disentanglement to related vision tasks.

Abstract

Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition, Removal, and Editing.

Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing

TL;DR

DARLING tackles the problem of entangled representations in scene text by disentangling style and content into separate features using synthetic image pairs with identical style but different content. The framework employs a decoupling block and a multi-task decoder that jointly supports STR, STE, and STRM, guided by an alignment loss on style features and a content-focused recognition loss, with a gated injection mechanism to fuse information for generation. The key contributions are (1) a disentangled pre-training paradigm, (2) a decoupled feature architecture, (3) a unified multi-task decoder, and (4) a synthetic paired dataset for robust evaluation, achieving state-of-the-art results across STR, STE, and STRM. This approach enables more adaptable, high-quality scene text processing and sets a foundation for applying style-content disentanglement to related vision tasks.

Abstract

Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition, Removal, and Editing.
Paper Structure (19 sections, 6 equations, 9 figures, 5 tables)

This paper contains 19 sections, 6 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: (a) The pipeline of previous representation learning methods that use a tightly coupled feature for all tasks. 'D' means decoder and 'R/E' represents the recognizer or eraser. (b) Our decoupled representation learning framework for multi-tasking.
  • Figure 2: The pipeline and training paradigm of our DARLING. The Decoupling Block divides features from the backbone into style and content features. The multi-task decoder processes these features to perform both discriminative and generative tasks. '[p]' is the padding symbol. Image pairs with the same style but different content are input. The style features are aligned and recognition loss supervises the content features to eliminate the style from content features.
  • Figure 3: The structure of Multi-task Decoder. It comprises the Generative Branch (GEB) and the Discriminative Branch (DIB), each dedicated to specific tasks. Gated Injection strategy is proposed to convey fine-grained details from DIB to GEB.
  • Figure 4: Some sample images from our generated datasets: TSE-4M and TSE-10k. The datasets comprise more diverse images with a variety of fonts and backgrounds, including low-quality images. TSE-10k can facilitate a more comprehensive evaluation of the model's performance.
  • Figure 5: Qualitative examples of text editing in real scenes. (a) Comparison of the generation quality. (b) Comparison of the ability to maintain style. (c) Comparison of the realism when the generated images are clear and readable.
  • ...and 4 more figures