Table of Contents
Fetching ...

CLII: Visual-Text Inpainting via Cross-Modal Predictive Interaction

Liang Zhao, Qing Guo, Xiaoguang Li, Song Wang

TL;DR

This paper introduces visual-text inpainting, a task that jointly reconstructs missing pixels in scene-text images and missing characters in the corresponding text by exploiting cross-modal cues. The proposed Cross-Modal Predictive Interaction (CLII) uses two interacting branches, ImgBranch and TxtBranch, that exchange latent embeddings through interactive multi-head attention to reinforce both image inpainting and text completion. It further demonstrates how CLII can be embedded into a state-of-the-art scene text spotting system (DeepSolo) to markedly improve robustness against image damage. Across three real datasets, CLII achieves substantial gains in image reconstruction quality and text completion accuracy, highlighting its potential to advance robust scene-text understanding in practical settings.

Abstract

Image inpainting aims to fill missing pixels in damaged images and has achieved significant progress with cut-edging learning techniques. Nevertheless, state-of-the-art inpainting methods are mainly designed for nature images and cannot correctly recover text within scene text images, and training existing models on the scene text images cannot fix the issues. In this work, we identify the visual-text inpainting task to achieve high-quality scene text image restoration and text completion: Given a scene text image with unknown missing regions and the corresponding text with unknown missing characters, we aim to complete the missing information in both images and text by leveraging their complementary information. Intuitively, the input text, even if damaged, contains language priors of the contents within the images and can guide the image inpainting. Meanwhile, the scene text image includes the appearance cues of the characters that could benefit text recovery. To this end, we design the cross-modal predictive interaction (CLII) model containing two branches, i.e., ImgBranch and TxtBranch, for scene text inpainting and text completion, respectively while leveraging their complementary effectively. Moreover, we propose to embed our model into the SOTA scene text spotting method and significantly enhance its robustness against missing pixels, which demonstrates the practicality of the newly developed task. To validate the effectiveness of our method, we construct three real datasets based on existing text-related datasets, containing 1838 images and covering three scenarios with curved, incidental, and styled texts, and conduct extensive experiments to show that our method outperforms baselines significantly.

CLII: Visual-Text Inpainting via Cross-Modal Predictive Interaction

TL;DR

This paper introduces visual-text inpainting, a task that jointly reconstructs missing pixels in scene-text images and missing characters in the corresponding text by exploiting cross-modal cues. The proposed Cross-Modal Predictive Interaction (CLII) uses two interacting branches, ImgBranch and TxtBranch, that exchange latent embeddings through interactive multi-head attention to reinforce both image inpainting and text completion. It further demonstrates how CLII can be embedded into a state-of-the-art scene text spotting system (DeepSolo) to markedly improve robustness against image damage. Across three real datasets, CLII achieves substantial gains in image reconstruction quality and text completion accuracy, highlighting its potential to advance robust scene-text understanding in practical settings.

Abstract

Image inpainting aims to fill missing pixels in damaged images and has achieved significant progress with cut-edging learning techniques. Nevertheless, state-of-the-art inpainting methods are mainly designed for nature images and cannot correctly recover text within scene text images, and training existing models on the scene text images cannot fix the issues. In this work, we identify the visual-text inpainting task to achieve high-quality scene text image restoration and text completion: Given a scene text image with unknown missing regions and the corresponding text with unknown missing characters, we aim to complete the missing information in both images and text by leveraging their complementary information. Intuitively, the input text, even if damaged, contains language priors of the contents within the images and can guide the image inpainting. Meanwhile, the scene text image includes the appearance cues of the characters that could benefit text recovery. To this end, we design the cross-modal predictive interaction (CLII) model containing two branches, i.e., ImgBranch and TxtBranch, for scene text inpainting and text completion, respectively while leveraging their complementary effectively. Moreover, we propose to embed our model into the SOTA scene text spotting method and significantly enhance its robustness against missing pixels, which demonstrates the practicality of the newly developed task. To validate the effectiveness of our method, we construct three real datasets based on existing text-related datasets, containing 1838 images and covering three scenarios with curved, incidental, and styled texts, and conduct extensive experiments to show that our method outperforms baselines significantly.
Paper Structure (27 sections, 18 equations, 5 figures, 4 tables)

This paper contains 27 sections, 18 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Visual-text inpainting vs. existing tasks. (a) Existing image inpainting and text completion methods, respectively. (b) The proposed visual-text inpainting. (c) Leveraging our method to enhance SOTA text spotting method ye2023deepsolo We highlight the key distinctions within dashed boxes.
  • Figure 2: Architecture of the proposed method. The visual text and plain text are randomly masked, respectively. "MHA" denotes the Multi-Head Attention.
  • Figure 3: Qualitative results on Total-Text, ICDAR2015, and TextSeg. From left to right, the columns are masked images, MAE he2022masked, MAE-FAR cao2022learning, MISFli2022misf, our results, and ground truth.
  • Figure 4: Performance comparison of three tasks.
  • Figure 5: Comparison of attention maps. "M" in red color is the damaged character.