Table of Contents
Fetching ...

FASTER: A Font-Agnostic Scene Text Editing and Rendering Framework

Alloy Das, Sanket Biswas, Prasun Roy, Subhankar Ghosh, Umapada Pal, Michael Blumenstein, Josep Lladós, Saumik Bhattacharya

TL;DR

A combined fusion of target mask generation and style transfer units, with a cascaded self-attention mech-anism has been proposed to focus on multi-level text region edits to handle varying word lengths.

Abstract

Scene Text Editing (STE) is a challenging research problem, that primarily aims towards modifying existing texts in an image while preserving the background and the font style of the original text. Despite its utility in numerous real-world applications, existing style-transfer-based approaches have shown sub-par editing performance due to (1) complex image backgrounds, (2) diverse font attributes, and (3) varying word lengths within the text. To address such limitations, in this paper, we propose a novel font-agnostic scene text editing and rendering framework, named FASTER, for simultaneously generating text in arbitrary styles and locations while preserving a natural and realistic appearance and structure. A combined fusion of target mask generation and style transfer units, with a cascaded self-attention mechanism has been proposed to focus on multi-level text region edits to handle varying word lengths. Extensive evaluation on a real-world database with further subjective human evaluation study indicates the superiority of FASTER in both scene text editing and rendering tasks, in terms of model performance and efficiency. Our code will be released upon acceptance.

FASTER: A Font-Agnostic Scene Text Editing and Rendering Framework

TL;DR

A combined fusion of target mask generation and style transfer units, with a cascaded self-attention mech-anism has been proposed to focus on multi-level text region edits to handle varying word lengths.

Abstract

Scene Text Editing (STE) is a challenging research problem, that primarily aims towards modifying existing texts in an image while preserving the background and the font style of the original text. Despite its utility in numerous real-world applications, existing style-transfer-based approaches have shown sub-par editing performance due to (1) complex image backgrounds, (2) diverse font attributes, and (3) varying word lengths within the text. To address such limitations, in this paper, we propose a novel font-agnostic scene text editing and rendering framework, named FASTER, for simultaneously generating text in arbitrary styles and locations while preserving a natural and realistic appearance and structure. A combined fusion of target mask generation and style transfer units, with a cascaded self-attention mechanism has been proposed to focus on multi-level text region edits to handle varying word lengths. Extensive evaluation on a real-world database with further subjective human evaluation study indicates the superiority of FASTER in both scene text editing and rendering tasks, in terms of model performance and efficiency. Our code will be released upon acceptance.
Paper Structure (28 sections, 10 equations, 18 figures, 8 tables)

This paper contains 28 sections, 10 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Given an image and the desired text to render, FASTER performs appropriate edits on the target text regions in complex real scenes with high consistency and realism on a wide range of typefaces and multiple font attributes with varying word lengths.
  • Figure 2: Overall Architecture of FASTER. In Stage-I, an approximate target style mask $\overline{m_B}$ is estimated from the source image $I_A$, source style mask $m_A$, and a fixed style mask $m_F$ of the target text. In Stage-II, the target image $\overline{I_B}$ is generated by transferring image attributes from $I_A$ and conditioning the image translation on the structural guidance $(m_A, \overline{m_B})$.
  • Figure 3: Human Evaluation Study
  • Figure 4: Visual STE results with FASTER on samples from the Real dataset, selected from the human evaluation study with the highest number of incorrect predictions. Please zoom 300% for better visualization.
  • Figure 5: Left Fancy Font Text Editing. Right Visual comparison with TextDiffuser and DiffSTE on larger context
  • ...and 13 more figures