Table of Contents
Fetching ...

Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation

Minxing Luo, Zixun Xia, Liaojun Chen, Zhenhang Li, Weichao Zeng, Jianye Wang, Wentao Cheng, Yaxing Wang, Yu Zhou, Jian Yang

TL;DR

This work tackles the challenge of generating accurate visual text within complex layouts using diffusion models. It introduces STGen, a training-free dual-branch framework comprising a Semantic Rectification Branch and a Structure Injection Branch, which jointly refine text region semantics and glyph structure in the latent space. Key techniques include a Reference Branch for semantic priors, AdaIN-based latent merging, and a Divide and Conquer strategy to handle multi-part layouts, all without retraining. Empirical results on a benchmark derived from AnyText show STGen consistently improves OCR accuracy, text-image harmony, and user preference across English and Chinese, demonstrating practical impact for real-world visual text generation tasks.

Abstract

In real-world images, slanted or curved texts, especially those on cans, banners, or badges, appear as frequently, if not more so, than flat texts due to artistic design or layout constraints. While high-quality visual text generation has become available with the advanced generative capabilities of diffusion models, these models often produce distorted text and inharmonious text background when given slanted or curved text layouts due to training data limitation. In this paper, we introduce a new training-free framework, STGen, which accurately generates visual texts in challenging scenarios (\eg, slanted or curved text layouts) while harmonizing them with the text background. Our framework decomposes the visual text generation process into two branches: (i) \textbf{Semantic Rectification Branch}, which leverages the ability in generating flat but accurate visual texts of the model to guide the generation of challenging scenarios. The generated latent of flat text is abundant in accurate semantic information related both to the text itself and its background. By incorporating this, we rectify the semantic information of the texts and harmonize the integration of the text with its background in complex layouts. (ii) \textbf{Structure Injection Branch}, which reinforces the visual text structure during inference. We incorporate the latent information of the glyph image, rich in glyph structure, as a new condition to further strengthen the text structure. To enhance image harmony, we also apply an effective combination method to merge the priors, providing a solid foundation for generation. Extensive experiments across a variety of visual text layouts demonstrate that our framework achieves superior accuracy and outstanding quality.

Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation

TL;DR

This work tackles the challenge of generating accurate visual text within complex layouts using diffusion models. It introduces STGen, a training-free dual-branch framework comprising a Semantic Rectification Branch and a Structure Injection Branch, which jointly refine text region semantics and glyph structure in the latent space. Key techniques include a Reference Branch for semantic priors, AdaIN-based latent merging, and a Divide and Conquer strategy to handle multi-part layouts, all without retraining. Empirical results on a benchmark derived from AnyText show STGen consistently improves OCR accuracy, text-image harmony, and user preference across English and Chinese, demonstrating practical impact for real-world visual text generation tasks.

Abstract

In real-world images, slanted or curved texts, especially those on cans, banners, or badges, appear as frequently, if not more so, than flat texts due to artistic design or layout constraints. While high-quality visual text generation has become available with the advanced generative capabilities of diffusion models, these models often produce distorted text and inharmonious text background when given slanted or curved text layouts due to training data limitation. In this paper, we introduce a new training-free framework, STGen, which accurately generates visual texts in challenging scenarios (\eg, slanted or curved text layouts) while harmonizing them with the text background. Our framework decomposes the visual text generation process into two branches: (i) \textbf{Semantic Rectification Branch}, which leverages the ability in generating flat but accurate visual texts of the model to guide the generation of challenging scenarios. The generated latent of flat text is abundant in accurate semantic information related both to the text itself and its background. By incorporating this, we rectify the semantic information of the texts and harmonize the integration of the text with its background in complex layouts. (ii) \textbf{Structure Injection Branch}, which reinforces the visual text structure during inference. We incorporate the latent information of the glyph image, rich in glyph structure, as a new condition to further strengthen the text structure. To enhance image harmony, we also apply an effective combination method to merge the priors, providing a solid foundation for generation. Extensive experiments across a variety of visual text layouts demonstrate that our framework achieves superior accuracy and outstanding quality.
Paper Structure (23 sections, 7 equations, 7 figures, 5 tables)

This paper contains 23 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: STGen for visual text generation in challenging layout.Using a pre-trained visual text generation model (e.g., AnyText tuo2023anytext), our method, STGen, guides the model to adjust the text region in latent space during image synthesis, producing images that more faithfully represent the input prompt with precise visual text.
  • Figure 2: Failure cases of AnyText tuo2023anytext. The top row illustrates two failure cases: textual distortion (left) and background occlusion (right). The bottom row displays results using our method.
  • Figure 3: Comparison of predicted $x_0$ under different inference steps.The first and second rows show AnyText’s intermediate predicted $x_0$ for flat and slanted masks, respectively. While predictions remain stable under flat masks, the $x_0$ prediction drifts during inference with slanted masks. Our method effectively guides the model to maintain accuracy in visual texts.
  • Figure 4: Pipeline of our method. Given user input on the leftmost side, which contains a prompt and a mask $l_p$ specifying positions for generating visual texts, we first split the $l_p$ using Divide and Conquer Strategy and obtain glyph image $l_g$ and flat position mask $\Tilde{l}_p$. Then $\Tilde{l}_p$ and $l_g$ are input to the Semantic Rectification Branch and Structure Injection Branch respectively. In the Semantic Rectification Branch, based on $\Tilde{l}_p$, we render flat glyph $\Tilde{l}_g$ and use them along with prompt and random noise $z_T$ to generate the latent with flat visual text. This latent serves as a semantic prior, providing rich semantic information for both the generation of the text and its background. $l_g$, on the other hand, is converted into the latent space as a structural prior for structural refinement of the text. Finally, the two prior combined to guide the generation of the visual text in $l_p$.
  • Figure 5: Qualitative comparison of our method and state-of-the-art models in both English and Chinese text generation.
  • ...and 2 more figures