Table of Contents
Fetching ...

AnyText2: Visual Text Generation and Editing With Customizable Attributes

Yuxiang Tuo, Yifeng Geng, Liefeng Bo

TL;DR

AnyText2 tackles the challenge of precise multilingual text rendering and per-line attribute control in natural scene image generation. It introduces WriteNet+AttnX to decouple text writing from image synthesis and a Text Embedding Module with glyph, position, font, and color encoders to condition text attributes, enabling both embedded and overlaid text. The approach yields state-of-the-art text accuracy, improved image realism, and a 19.8% inference speedup, validated on a large multilingual dataset with long captions to enhance prompt-following. The work enables practical applications like logo and poster design, with open-source code provided for broader adoption.

Abstract

As the text-to-image (T2I) domain progresses, generating text that seamlessly integrates with visual content has garnered significant attention. However, even with accurate text generation, the inability to control font and color can greatly limit certain applications, and this issue remains insufficiently addressed. This paper introduces AnyText2, a novel method that enables precise control over multilingual text attributes in natural scene image generation and editing. Our approach consists of two main components. First, we propose a WriteNet+AttnX architecture that injects text rendering capabilities into a pre-trained T2I model. Compared to its predecessor, AnyText, our new approach not only enhances image realism but also achieves a 19.8% increase in inference speed. Second, we explore techniques for extracting fonts and colors from scene images and develop a Text Embedding Module that encodes these text attributes separately as conditions. As an extension of AnyText, this method allows for customization of attributes for each line of text, leading to improvements of 3.3% and 9.3% in text accuracy for Chinese and English, respectively. Through comprehensive experiments, we demonstrate the state-of-the-art performance of our method. The code and model will be made open-source in https://github.com/tyxsspa/AnyText2.

AnyText2: Visual Text Generation and Editing With Customizable Attributes

TL;DR

AnyText2 tackles the challenge of precise multilingual text rendering and per-line attribute control in natural scene image generation. It introduces WriteNet+AttnX to decouple text writing from image synthesis and a Text Embedding Module with glyph, position, font, and color encoders to condition text attributes, enabling both embedded and overlaid text. The approach yields state-of-the-art text accuracy, improved image realism, and a 19.8% inference speedup, validated on a large multilingual dataset with long captions to enhance prompt-following. The work enables practical applications like logo and poster design, with open-source code provided for broader adoption.

Abstract

As the text-to-image (T2I) domain progresses, generating text that seamlessly integrates with visual content has garnered significant attention. However, even with accurate text generation, the inability to control font and color can greatly limit certain applications, and this issue remains insufficiently addressed. This paper introduces AnyText2, a novel method that enables precise control over multilingual text attributes in natural scene image generation and editing. Our approach consists of two main components. First, we propose a WriteNet+AttnX architecture that injects text rendering capabilities into a pre-trained T2I model. Compared to its predecessor, AnyText, our new approach not only enhances image realism but also achieves a 19.8% increase in inference speed. Second, we explore techniques for extracting fonts and colors from scene images and develop a Text Embedding Module that encodes these text attributes separately as conditions. As an extension of AnyText, this method allows for customization of attributes for each line of text, leading to improvements of 3.3% and 9.3% in text accuracy for Chinese and English, respectively. Through comprehensive experiments, we demonstrate the state-of-the-art performance of our method. The code and model will be made open-source in https://github.com/tyxsspa/AnyText2.

Paper Structure

This paper contains 23 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: AnyText2 can accurately generate multilingual text within images and achieve a realistic integration. Furthermore, it allows for customized attributes for each line, such as controlling the font style through font files, mimicking an image using a brush tool, and specifying the text color. Additionally, AnyText2 enables customizable attribute editing of text within images.
  • Figure 2: The framework of AnyText2, which is designed with a WriteNet+AttnX architecture to integrate text generation capability into pre-train diffusion models, and there is a Text Embedding Module to provide various conditional control for text generation.
  • Figure 3: By adjusting the strength coefficient $\alpha$ from 0 to 1 shows that the text-image fusion is gradually improving.
  • Figure 4: Examples of customizing text attributes. The first row demonstrates font style control using a user-specified font file. The second row showcases selecting a text region from an image to mimic its font style. The third row illustrates the control of text color.
  • Figure 5: Qualitative comparison of AnyText2 and other methods. From the perspectives of text accuracy, text-image integration, attribute customization, and multilingual support, AnyText2 demonstrated significant advantages.
  • ...and 3 more figures