Expressive Text-to-Image Generation with Rich Text

Songwei Ge; Taesung Park; Jun-Yan Zhu; Jia-Bin Huang

Expressive Text-to-Image Generation with Rich Text

Songwei Ge, Taesung Park, Jun-Yan Zhu, Jia-Bin Huang

TL;DR

This work introduces rich-text-to-image generation, enabling fine-grained, region-wise control of text-to-image synthesis by leveraging a rich-text editor's attributes (color, style, texture, footnotes, embedded images). It uses a two-stage region-based diffusion framework that derives word-to-region layouts from plain-text attention, then applies region-specific prompts and injections to render local attributes while preserving overall structure. A new rich-text benchmark assesses precise color rendering, local style control, and complex prompt alignment, with extensive quantitative and qualitative evidence showing improvements over baselines. The approach is compatible with multiple diffusion models (e.g., SD1-5, SDXL) and can be extended to editing real images, highlighting practical impact for customizable, region-aware image generation and editing tasks.

Abstract

Plain text has become a prevalent interface for text-to-image synthesis. However, its limited customization options hinder users from accurately describing desired outputs. For example, plain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Furthermore, creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret. To address these challenges, we propose using a rich-text editor supporting formats such as font style, size, color, and footnote. We extract each word's attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis. We achieve these capabilities through a region-based diffusion process. We first obtain each word's region based on attention maps of a diffusion process using plain text. For each region, we enforce its text attributes by creating region-specific detailed prompts and applying region-specific guidance, and maintain its fidelity against plain-text generation through region-based injections. We present various examples of image generation from rich text and demonstrate that our method outperforms strong baselines with quantitative evaluations.

Expressive Text-to-Image Generation with Rich Text

TL;DR

Abstract

Paper Structure (16 sections, 11 equations, 32 figures, 5 tables)

This paper contains 16 sections, 11 equations, 32 figures, 5 tables.

Introduction
Related Work
Rich Text to Image Generation
Problem Setting
Method
Experimental Results
Experimental Setups
Rich-Text Benchmark
Quantitative Comparison
Visual Comparison
Ablation Study
Discussion and Limitations
Additional Results
Additional Details
Rich-text Benchmark
...and 1 more sections

Figures (32)

Figure 1: Plain text (left image) vs. Rich text (right image) Our method allows a user to describe an image using a rich text editor that supports various text attributes like font family, size, color, and footnote. Given these text attributes extracted from rich-text prompts, our method enables precise control of text-to-image synthesis regarding colors, styles, and object details compared to plain text.
Figure 2: Rich-text-to-image framework. First, the plain-text prompt is processed by a diffusion model to collect self- and cross-attention maps, noised generation, and residual feature maps at certain steps. The token maps of the input prompt are constructed by first creating a segmentation using the self-attention maps and then labeling each segment using the cross-attention maps. Then the rich texts are processed as JSON to provide attributes for each token span. The resulting token maps and attributes are used to guide our region-based control. We inject the self-attention maps, noised generation, and feature maps to improve fidelity to the plain-text generation.
Figure 3: Token map creation. We average the collected self- and cross-attention maps to create token maps that indicate the layout of the input prompt. The segmentation is first constructed by spectral clustering using the self-attention maps. Then, the averaged cross-attention maps are adopted to label each segment using annotated tokens.
Figure 4: Region-based diffusion. We fulfill the guidance specified by the rich-text attributes through separate diffusion processes. Depending on the functionality, the attributes are either interpreted as a region-based guidance target (e.g. re-coloring the church), or as a textual input to the diffusion UNet (e.g. handling the embedded image describing the snowy mountain). The self-attention maps and feature maps extracted from the plain-text generation process are injected to help preserve the structure. The predicted noise $\epsilon_{t,\boldsymbol{e}_i}$, weighted by the token map, and the guidance gradient $\frac{\partial{L}}{\partial{\mathbf{x}_t}}$ are used to denoise and update the previous generation $\mathbf{x}_t$ to $\mathbf{x}_{t-1}$. The noised plain text generation $\mathbf{x}^\text{plain}_t$ is blended with the current generation to preserve the exact content in those regions of the unformatted tokens.
Figure 5: Qualitative comparison on precise color generation. We show images generated by Prompt-to-Prompt hertz2022prompt, InstructPix2Pix brooks2022instructpix2pix, and our method using prompts with font colors. Our method generates precise colors according to either color names or RGB values. Both baselines generate plausible but inaccurate colors given color names, while neither understands the color defined by RGB values. InstructPix2Pix tends to apply the color globally, even outside the target object.
...and 27 more figures

Expressive Text-to-Image Generation with Rich Text

TL;DR

Abstract

Expressive Text-to-Image Generation with Rich Text

Authors

TL;DR

Abstract

Table of Contents

Figures (32)