Table of Contents
Fetching ...

AMO Sampler: Enhancing Text Rendering with Overshooting

Xixi Hu, Keyang Xu, Bo Liu, Qiang Liu, Hongliang Fei

TL;DR

AMO introduces a training-free overshooting sampler for Rectified Flow models to improve text rendering in text-to-image generation. By alternating over-simulation of the learned ODE with noise injection and integrating an attention-guided per-patch modulation, AMO effectively implements a Langevin dynamics correction that reduces text misspellings without increasing inference cost. Empirical results show substantial gains in text accuracy (e.g., 32.3% for SD3 and 35.9% for Flux in text rendering) and human OCR-based performance improvements across multiple RF-based T2I models, while maintaining or improving overall image quality. The approach is lightweight, model-agnostic, and readily adoptable for existing RF-based T2I pipelines, offering a practical path to sharper, more faithful text in generated images.

Abstract

Achieving precise alignment between textual instructions and generated images in text-to-image generation is a significant challenge, particularly in rendering written text within images. Sate-of-the-art models like Stable Diffusion 3 (SD3), Flux, and AuraFlow still struggle with accurate text depiction, resulting in misspelled or inconsistent text. We introduce a training-free method with minimal computational overhead that significantly enhances text rendering quality. Specifically, we introduce an overshooting sampler for pretrained rectified flow (RF) models, by alternating between over-simulating the learned ordinary differential equation (ODE) and reintroducing noise. Compared to the Euler sampler, the overshooting sampler effectively introduces an extra Langevin dynamics term that can help correct the compounding error from successive Euler steps and therefore improve the text rendering. However, when the overshooting strength is high, we observe over-smoothing artifacts on the generated images. To address this issue, we propose an Attention Modulated Overshooting sampler (AMO), which adaptively controls the strength of overshooting for each image patch according to their attention score with the text content. AMO demonstrates a 32.3% and 35.9% improvement in text rendering accuracy on SD3 and Flux without compromising overall image quality or increasing inference cost. Code available at: https://github.com/hxixixh/amo-release.

AMO Sampler: Enhancing Text Rendering with Overshooting

TL;DR

AMO introduces a training-free overshooting sampler for Rectified Flow models to improve text rendering in text-to-image generation. By alternating over-simulation of the learned ODE with noise injection and integrating an attention-guided per-patch modulation, AMO effectively implements a Langevin dynamics correction that reduces text misspellings without increasing inference cost. Empirical results show substantial gains in text accuracy (e.g., 32.3% for SD3 and 35.9% for Flux in text rendering) and human OCR-based performance improvements across multiple RF-based T2I models, while maintaining or improving overall image quality. The approach is lightweight, model-agnostic, and readily adoptable for existing RF-based T2I pipelines, offering a practical path to sharper, more faithful text in generated images.

Abstract

Achieving precise alignment between textual instructions and generated images in text-to-image generation is a significant challenge, particularly in rendering written text within images. Sate-of-the-art models like Stable Diffusion 3 (SD3), Flux, and AuraFlow still struggle with accurate text depiction, resulting in misspelled or inconsistent text. We introduce a training-free method with minimal computational overhead that significantly enhances text rendering quality. Specifically, we introduce an overshooting sampler for pretrained rectified flow (RF) models, by alternating between over-simulating the learned ordinary differential equation (ODE) and reintroducing noise. Compared to the Euler sampler, the overshooting sampler effectively introduces an extra Langevin dynamics term that can help correct the compounding error from successive Euler steps and therefore improve the text rendering. However, when the overshooting strength is high, we observe over-smoothing artifacts on the generated images. To address this issue, we propose an Attention Modulated Overshooting sampler (AMO), which adaptively controls the strength of overshooting for each image patch according to their attention score with the text content. AMO demonstrates a 32.3% and 35.9% improvement in text rendering accuracy on SD3 and Flux without compromising overall image quality or increasing inference cost. Code available at: https://github.com/hxixixh/amo-release.

Paper Structure

This paper contains 39 sections, 2 theorems, 33 equations, 13 figures, 5 tables, 1 algorithm.

Key Result

Lemma A.1

Assume random variables ${\boldsymbol{X}} = {\boldsymbol{Y}} + {\boldsymbol{Z}}$, where ${\boldsymbol{Y}}$ and ${\boldsymbol{Z}}$ are independent, then where $\rho_{\boldsymbol{Z}}$ and $\rho_{{\boldsymbol{Y}}}$ are the density functions of ${\boldsymbol{Z}}$ and ${\boldsymbol{Y}}$, respectively.

Figures (13)

  • Figure 1: Visualization of the Overshooting Sampler. Given $\tilde{{\boldsymbol{Z}}}_t$ at time $t$, we first over-simulate the learned ODE to $\hat{{\boldsymbol{Z}}}_o$, and then add noise and return to $\tilde{{\boldsymbol{Z}}}_s$. The noise is carefully selected such that $\tilde{{\boldsymbol{Z}}}_s$ matches ${\boldsymbol{X}}_s$'s marginal distribution.
  • Figure 2: Euler versus Overshooting on a toy dataset. The noise ($\pi_0$) and data ($\pi_1$) distributions are shown as blue and light-purple dots. Top: The samples from Euler deviate from $\pi_1$. Overshooting sampler helps correct the marginal. As $c$ increases, the correction effect is stronger, but it also introduces smoothing artifacts. Bottom: Starting with $\tilde{{\boldsymbol{Z}}}_t$ ($t=0.5$) from the Euler sampler, if we apply 5 times of (Overshooting - Euler), i.e., the Langevin dynamics part in Equation \ref{['eq:langevin-sde-2']}), the samples align better with ${\boldsymbol{Z}}_{0.5}$.
  • Figure 3: Comparison of text rendering quality between Euler and our stochastic sampling method across three different text-to-image models: (a) Flux, (b) Stable Diffusion 3 (SD3), and (c) AuraFlow. All results are generated using the same random seed for consistent comparison. Within each pair of images, the left column corresponds to the Euler sampler, while the right column displays the results from our method. Our approach consistently generates clearer and more legible text that closely matches the provided prompts. Additional examples are provided in the Appendix.
  • Figure 4: The comparison of Euler sampler and AMO across different sampling steps (20, 50, and 100 steps). AMO consistently outperforms the deterministic sampler on text rendering performance across all step sizes.
  • Figure 5: Image Quality for Euler, Overshooting, and AMO. Please zoom in for details. Bottom: both Overshooting (AMO without attention modulation) and AMO render the correct texts, while Euler renders misspelled texts. Top: Looking at the parrot's feather or the smoke behind the saxophone, Euler generates high-fidelity high-frequency details while the Overshooting sampler over-smooths the image (fewer details). AMO preserves the details from the Euler, with attention modulation. In addition, we conduct 5 Steps Overshooting, meaning that we use $c' = c/5$ but apply (Overshoot - Euler) 5 times (i.e., the Langevin step in Equation \ref{['eq:langevin-sde-2']}) followed by 1 Euler step in the end at each time $t$. We see that with smaller $c$ but more local Langevin steps the smoothing effect also goes away, but in practice, this requires more model evaluations.
  • ...and 8 more figures

Theorems & Definitions (4)

  • Lemma A.1
  • proof
  • Lemma A.2
  • proof