Table of Contents
Fetching ...

OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation

Agus Gunawan, Samuel Teodoro, Yun Chen, Soo Ye Kim, Jihyong Oh, Munchurl Kim

TL;DR

OmniText tackles the challenge of robust text manipulation within images without additional training by analyzing cross- and self-attention in a text-oriented diffusion backbone. It introduces Text Removal via Self-Attention Inversion and Cross-Attention Reassignment, and Controllable Inpainting guided by a Grid Trick and latent optimization with Cross-Attention Content Loss and Self-Attention Style Loss, enabling precise content and style control. A new OmniText-Bench dataset supports evaluation across five TIM tasks, including style-based insertion and editing. Empirical results show OmniText outperforms generalist TIM baselines and matches or surpasses specialist methods on multiple metrics, demonstrating the value of attention-focused, training-free TIM for practical editing workflows.

Abstract

Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model's tendency to focus on surrounding text, thus reducing text hallucinations. Additionally, we redistribute cross-attention, as increasing the probability of certain text tokens reduces text hallucination. For controllable inpainting, we introduce novel loss functions in a latent optimization framework: a cross-attention content loss to improve text rendering accuracy and a self-attention style loss to facilitate style customization. Furthermore, we present OmniText-Bench, a benchmark dataset for evaluating diverse TIM tasks. It includes input images, target text with masks, and style references, covering diverse applications such as text removal, rescaling, repositioning, and insertion and editing with various styles. Our OmniText framework is the first generalist method capable of performing diverse TIM tasks. It achieves state-of-the-art performance across multiple tasks and metrics compared to other text inpainting methods and is comparable with specialist methods.

OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation

TL;DR

OmniText tackles the challenge of robust text manipulation within images without additional training by analyzing cross- and self-attention in a text-oriented diffusion backbone. It introduces Text Removal via Self-Attention Inversion and Cross-Attention Reassignment, and Controllable Inpainting guided by a Grid Trick and latent optimization with Cross-Attention Content Loss and Self-Attention Style Loss, enabling precise content and style control. A new OmniText-Bench dataset supports evaluation across five TIM tasks, including style-based insertion and editing. Empirical results show OmniText outperforms generalist TIM baselines and matches or surpasses specialist methods on multiple metrics, demonstrating the value of attention-focused, training-free TIM for practical editing workflows.

Abstract

Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model's tendency to focus on surrounding text, thus reducing text hallucinations. Additionally, we redistribute cross-attention, as increasing the probability of certain text tokens reduces text hallucination. For controllable inpainting, we introduce novel loss functions in a latent optimization framework: a cross-attention content loss to improve text rendering accuracy and a self-attention style loss to facilitate style customization. Furthermore, we present OmniText-Bench, a benchmark dataset for evaluating diverse TIM tasks. It includes input images, target text with masks, and style references, covering diverse applications such as text removal, rescaling, repositioning, and insertion and editing with various styles. Our OmniText framework is the first generalist method capable of performing diverse TIM tasks. It achieves state-of-the-art performance across multiple tasks and metrics compared to other text inpainting methods and is comparable with specialist methods.

Paper Structure

This paper contains 40 sections, 10 equations, 24 figures, 18 tables, 1 algorithm.

Figures (24)

  • Figure 1: Various text image manipulation applications on our OmniText-Bench using our proposed OmniText. OmniText is a training-free generalist that can control both text content and styles during text rendering. Our OmniText allows for a wide range of applications including additional tasks (e.g., style-based insertion) that existing methods cannot achieve.
  • Figure 2: TextDiff-2 limitations.
  • Figure 3: Key attention properties for controllable inpainting and removal at sampling step $t=751$ and at Decoder Block 2, Layer 0. $f$ stands for cross-attention manipulation.
  • Figure 4: Overview of OmniText for text editing. First, we perform Text Removal (TR) by modulating attention with our proposed CAR and SAI during sampling. Then, we apply Controllable Inpainting (CI) using a latent optimization strategy with content loss $\mathcal{L}_{C}$ and style loss $\mathcal{L}_S$ to control content and style, respectively.
  • Figure 5: Qualitative comparison on standard benchmark. We present visual comparisons for both all text and largest text settings, with some baselines excluded due to their poor performance. A complete comparison is provided in the Appendix.
  • ...and 19 more figures