Table of Contents
Fetching ...

TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing

Yuchen Bao, Yiting Wang, Wenjian Huang, Haowei Wang, Shen Chen, Taiping Yao, Shouhong Ding, Jianguo Zhang

TL;DR

Problem: Scene Text Editing (STE) requires editing text in images while preserving the background; existing methods struggle with incomplete disentanglement of content, style, and background. Approach: TripleFDS introduces explicit triple-feature disentanglement and a two-phase framework (disentanglement and synthesis) built on the SCB Synthesis dataset and SCB Groups to enable diverse, self-supervised training; it uses inter-group contrastive loss and intra-sample orthogonality, plus a feature remapping strategy during reconstruction to prevent leakage. Contributions: a novel triple-feature disentanglement framework, the SCB Synthesis dataset, state-of-the-art results on mainstream STE benchmarks, and flexible operations like style replacement and background transfer. Significance: enables highly controllable, high-fidelity scene text editing with robust reconstruction and better generalization to real-world images.

Abstract

Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency, the decisive factors of which can be divided into three parts, i.e., text style, text content, and background. Previous methods have struggled with incomplete disentanglement of editable attributes, typically addressing only one aspect - such as editing text content - thus limiting controllability and visual consistency. To overcome these limitations, we propose TripleFDS, a novel framework for STE with disentangled modular attributes, and an accompanying dataset called SCB Synthesis. SCB Synthesis provides robust training data for triple feature disentanglement by utilizing the "SCB Group", a novel construct that combines three attributes per image to generate diverse, disentangled training groups. Leveraging this construct as a basic training unit, TripleFDS first disentangles triple features, ensuring semantic accuracy through inter-group contrastive regularization and reducing redundancy through intra-sample multi-feature orthogonality. In the synthesis phase, TripleFDS performs feature remapping to prevent "shortcut" phenomena during reconstruction and mitigate potential feature leakage. Trained on 125,000 SCB Groups, TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks. Besides superior performance, the more flexible editing of TripleFDS supports new operations such as style replacement and background transfer. Code: https://github.com/yusenbao01/TripleFDS

TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing

TL;DR

Problem: Scene Text Editing (STE) requires editing text in images while preserving the background; existing methods struggle with incomplete disentanglement of content, style, and background. Approach: TripleFDS introduces explicit triple-feature disentanglement and a two-phase framework (disentanglement and synthesis) built on the SCB Synthesis dataset and SCB Groups to enable diverse, self-supervised training; it uses inter-group contrastive loss and intra-sample orthogonality, plus a feature remapping strategy during reconstruction to prevent leakage. Contributions: a novel triple-feature disentanglement framework, the SCB Synthesis dataset, state-of-the-art results on mainstream STE benchmarks, and flexible operations like style replacement and background transfer. Significance: enables highly controllable, high-fidelity scene text editing with robust reconstruction and better generalization to real-world images.

Abstract

Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency, the decisive factors of which can be divided into three parts, i.e., text style, text content, and background. Previous methods have struggled with incomplete disentanglement of editable attributes, typically addressing only one aspect - such as editing text content - thus limiting controllability and visual consistency. To overcome these limitations, we propose TripleFDS, a novel framework for STE with disentangled modular attributes, and an accompanying dataset called SCB Synthesis. SCB Synthesis provides robust training data for triple feature disentanglement by utilizing the "SCB Group", a novel construct that combines three attributes per image to generate diverse, disentangled training groups. Leveraging this construct as a basic training unit, TripleFDS first disentangles triple features, ensuring semantic accuracy through inter-group contrastive regularization and reducing redundancy through intra-sample multi-feature orthogonality. In the synthesis phase, TripleFDS performs feature remapping to prevent "shortcut" phenomena during reconstruction and mitigate potential feature leakage. Trained on 125,000 SCB Groups, TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks. Besides superior performance, the more flexible editing of TripleFDS supports new operations such as style replacement and background transfer. Code: https://github.com/yusenbao01/TripleFDS

Paper Structure

This paper contains 25 sections, 4 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: TripleFDS's capabilities. Purple lines denote additional feature permutation-based editing operations enabled by our approach.
  • Figure 2: Visualizing $3\times3\times3$ SCB Group and Feature Disentanglement.
  • Figure 3: Overview of our TripleFDS framework. This figure illustrates its core components and strategies: a minimal SCB Group (top-left) where yellow and brown lines denote remapping objects for style and background features under the Remapping Strategy; the pipeline for feature disentanglement and synthesis (middle), with rounded rectangles representing network structures, red diagonal lines indicating learnable token; and visualizations of the Inter loss and Intra loss (bottom-right).
  • Figure 4: Comparison of previous methods with ours. Previous methods tend to generate incorrect or fused text, as shown in the red boxes, while TripleFDS effectively mitigates these problems.
  • Figure 5: Different editing operations of TripleFDS, with the operations highlighted in purple representing those that TripleFDS can perform in addition to the capabilities of previous methods.
  • ...and 5 more figures