Table of Contents
Fetching ...

Towards Training-Free Scene Text Editing

Yubo Li, Xugong Qin, Peng Zhang, Hailun Lin, Gangyan Zeng, Kexin Zhang

Abstract

Scene text editing seeks to modify textual content in natural images while maintaining visual realism and semantic consistency. Existing methods often require task-specific training or paired data, limiting their scalability and adaptability. In this paper, we propose TextFlow, a training-free scene text editing framework that integrates the strengths of Attention Boost (AttnBoost) and Flow Manifold Steering (FMS) to enable flexible, high-fidelity text manipulation without additional training. Specifically, FMS preserves the structural and style consistency by modeling the visual flow of characters and background regions, while AttnBoost enhances the rendering of textual content through attention-based guidance. By jointly leveraging these complementary modules, our approach performs end-to-end text editing through semantic alignment and spatial refinement in a plug-and-play manner. Extensive experiments demonstrate that our framework achieves visual quality and text accuracy comparable to or superior to those of training-based counterparts, generalizing well across diverse scenes and languages. This study advances scene text editing toward a more efficient, generalizable, and training-free paradigm. Code is available at https://github.com/lyb18758/TextFlow

Towards Training-Free Scene Text Editing

Abstract

Scene text editing seeks to modify textual content in natural images while maintaining visual realism and semantic consistency. Existing methods often require task-specific training or paired data, limiting their scalability and adaptability. In this paper, we propose TextFlow, a training-free scene text editing framework that integrates the strengths of Attention Boost (AttnBoost) and Flow Manifold Steering (FMS) to enable flexible, high-fidelity text manipulation without additional training. Specifically, FMS preserves the structural and style consistency by modeling the visual flow of characters and background regions, while AttnBoost enhances the rendering of textual content through attention-based guidance. By jointly leveraging these complementary modules, our approach performs end-to-end text editing through semantic alignment and spatial refinement in a plug-and-play manner. Extensive experiments demonstrate that our framework achieves visual quality and text accuracy comparable to or superior to those of training-based counterparts, generalizing well across diverse scenes and languages. This study advances scene text editing toward a more efficient, generalizable, and training-free paradigm. Code is available at https://github.com/lyb18758/TextFlow

Paper Structure

This paper contains 24 sections, 15 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Comparison of the pipelines between training-based and training-free methods for scene text editing. Training-based methods require large-scale, high-quality paired data that require high computing resources. The training-free method mostly focuses on the attention map for general objects, but ignores the text accuracy and style consistency.
  • Figure 2: The overall framework of TextFlow. In the first phase, the source image is encoded into latent representations $\mathbf{z}_t$ and $\mathbf{z}_{src}$ via the VAE encoder, which are subsequently processed by the FMS module to generate concatenated representations $\mathbf{z}^{src,cat}_t$ and $\mathbf{z}^{tar,cat}_t$. These representations, along with their corresponding text embeddings $\mathbf{e}^{src}_p$ and $\mathbf{e}^{tar}_p$, are fed into parallel DiT blocks to compute the velocity field differential $\Delta_V$, ultimately producing the edited latent representation $\mathbf{z}_{edit}$; In the second phase, $\mathbf{z}_{edit}$ and the target embedding $\mathbf{e}^{tar}_{p}$ are processed by the AttnBoost DiT (AB-DiT), where concatenation and self-attention operations generate refined text-to-image attention maps that enhance textual rendering accuracy through spatial-aware amplification.
  • Figure 3: Illustration of the proposed FMS Model. The latent representations $\mathbf{z}_t$ and $\mathbf{z}_{src}$ are processed with random noise $\epsilon$ through linear interpolation and vector arithmetic operations to maintain style consistency.
  • Figure 4: Qualitative Analysis. The compared methods include both training-based STE approaches like DiffSTE ji2023improving, AnyText tuo2023anytext, TextFlux xie2025textflux and recent training-free editing techniques FlowEdit kulikov2024flowedit. We also include the powerful foundational model Flux-Kontext labs2025flux1kontextflowmatching (F-Kontext), for a more extensive comparison.
  • Figure 5: Qualitative comparison among different DiT-based methods on a full-size image.
  • ...and 9 more figures