GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Zexuan Yan; Jiarui Jin; Yue Ma; Shijian Wang; Jiahui Hu; Wenxiang Jiao; Yuan Lu; Linfeng Zhang

GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Zexuan Yan, Jiarui Jin, Yue Ma, Shijian Wang, Jiahui Hu, Wenxiang Jiao, Yuan Lu, Linfeng Zhang

Abstract

Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.

GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Abstract

Paper Structure (84 sections, 10 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 84 sections, 10 equations, 13 figures, 5 tables, 1 algorithm.

Introduction
Related work
Preliminaries
Multimodal Diffusion Transformer
Methods
Extraction Stage
Draft Preview Stage
Glyph Injection Stage
Frequency Decomposition.
Injection with Attention Enhancement.
Style Refinement Stage
Iterative Refinement.
Benchmark and Evaluation Protocals
Benchmark
Evaluation Protocols
...and 69 more sections

Figures (13)

Figure 1: Gallery of various text rendering results sampled by GlyphBanana.
Figure 2: The illustration of motivation. We observe that while in-distribution cases show satisfying precision-style banlance, there exists huge gap between OOD cases and deterministic rendered texts.
Figure 3: Overview of the GlyphBanana agentic pipeline. The workflow comprises four stages: (1) Extraction Stage parses the input into text content and style attributes; (2) Draft Preview Stage generates an initial image via a Layout Planner; (3) Glyph Injection Stage applies Frequency Decomposition in latent space and Attention Re-weighting inside each DiT block; (4) Style Refinement Stage employs iterative refinement with a Style Refiner and Score Judger. The bottom panel details the denoising process with the Attention Re-weighting.
Figure 4: Illustration of the GlyphBanana-Benchmark with auxiliary tools. The proposed benchmark consists of two categories. General Text for Rendering assesses standard and stylized text rendering. Formulas from Easy to Complex evaluates formula rendering across varying complexities
Figure 5: Qualitative comparisons with other baselines. Fail denotes the FLUX.1-dev based models unable to follow instructions to render chinese text due to its limited text-encoder. Besides, we color the quoted text in red, referring to the target text to be rendered, and color the style text related to the glyph in blue.
...and 8 more figures

GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Abstract

GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Authors

Abstract

Table of Contents

Figures (13)