Table of Contents
Fetching ...

TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering

Hanshen Zhu, Yuliang Liu, Xuecheng Wu, An-Lan Wang, Hao Feng, Dingkang Yang, Chao Feng, Can Huang, Jingqun Tang, Xiang Bai

TL;DR

TextPecker is proposed, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator, providing a foundational step towards reliable and structural faithful visual text generation.

Abstract

Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., Seedream4.0, Qwen-Image) still struggle to render structurally faithful text. To address this, we propose TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator. To enable this capability, we construct a recognition dataset with character-level structural-anomaly annotations and develop a stroke-editing synthesis engine to expand structural-error coverage. Experiments show that TextPecker consistently improves diverse text-to-image models; even on the well-optimized Qwen-Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR. Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.

TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering

TL;DR

TextPecker is proposed, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator, providing a foundational step towards reliable and structural faithful visual text generation.

Abstract

Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., Seedream4.0, Qwen-Image) still struggle to render structurally faithful text. To address this, we propose TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator. To enable this capability, we construct a recognition dataset with character-level structural-anomaly annotations and develop a stroke-editing synthesis engine to expand structural-error coverage. Experiments show that TextPecker consistently improves diverse text-to-image models; even on the well-optimized Qwen-Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR. Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.
Paper Structure (34 sections, 6 equations, 13 figures, 14 tables)

This paper contains 34 sections, 6 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Existing OCR models and MLLMs struggle to perceive fine-grained structural anomalies in rendered text images, creating a key bottleneck for both VTR evaluation and RL-based optimization. Misrecognized characters are highlighted in RED.
  • Figure 2: Schematic illustration of the TextPecker framework. Given a generative prompt, we first sample $G$ candidate outputs $\{o_i\}_{i=1}^G$ from the reference policy model $\pi_{\theta_{\text{ref}}}$. Each $o_i$ is sent to a structure-aware recognizer to extract fine-grained generated text, with markers indicating structurally anomalous text. We then compute the joint reward $\mathcal{R}_i$, comprising a weighted sum of semantic alignment and structural quality scores (Sec. \ref{['sec:reward_modeling']}). Each $\mathcal{R}_i$ is normalized to a group relative advantage $A_i$. Finally, we optimize the current policy model $\pi_\theta$ by maximizing $A_i$ while enforcing proximity to $\pi_{\text{ref}}$ via KL divergence.
  • Figure 3: The illustration of proposed data construction pipeline.
  • Figure 4: Qualitative Comparisons of Text Rendering for Qwen-Image and RL-Optimized Variants. Readers are highly recommended to refer to the appendix for extensive comparisons across generative models and RL baselines.
  • Figure 5: Qualitative Comparisons of Text Rendering for Flux.1[dev] flux and RL-Optimized Variants.
  • ...and 8 more figures