Table of Contents
Fetching ...

CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder

Lichen Ma, Tiezhu Yue, Pei Fu, Yujie Zhong, Kai Zhou, Xiaoming Wei, Jie Hu

TL;DR

This work tackles the persistent challenge of inaccurate character rendering in diffusion-based visual text generation. It introduces CharGen, a character-level multimodal encoder coupled with an ODM-based multi-scale CharGen perceptual loss, integrated into a ControlNet diffusion framework with Long-CLIP to handle long text captions. The method achieves state-of-the-art performance on English and Chinese benchmarks (AnyText-benchmark and MARIO-Eval), notably improving Sen.ACC by about 8.8% on English and 5.5% on Chinese, while delivering sharper glyphs for complex, multi-stroke characters. These advances enhance both the fidelity of rendered visual text and the reliability of visual-text editing in multilingual scenes, with practical implications for OCR-aware image synthesis and downstream text-centric applications. The training objective combines diffusion control with glyph-focused supervision via $L = L_{cd} + \lambda L_{chargen}$, where $\lambda$ is tuned (e.g., $0.01$), and the system leverages Long-CLIP to expand text capacity to $248$ tokens.

Abstract

Recently, significant advancements have been made in diffusion-based visual text generation models. Although the effectiveness of these methods in visual text rendering is rapidly improving, they still encounter challenges such as inaccurate characters and strokes when rendering complex visual text. In this paper, we propose CharGen, a highly accurate character-level visual text generation and editing model. Specifically, CharGen employs a character-level multimodal encoder that not only extracts character-level text embeddings but also encodes glyph images character by character. This enables it to capture fine-grained cross-modality features more effectively. Additionally, we introduce a new perceptual loss in CharGen to enhance character shape supervision and address the issue of inaccurate strokes in generated text. It is worth mentioning that CharGen can be integrated into existing diffusion models to generate visual text with high accuracy. CharGen significantly improves text rendering accuracy, outperforming recent methods in public benchmarks such as AnyText-benchmark and MARIO-Eval, with improvements of more than 8% and 6%, respectively. Notably, CharGen achieved a 5.5% increase in accuracy on Chinese test sets.

CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder

TL;DR

This work tackles the persistent challenge of inaccurate character rendering in diffusion-based visual text generation. It introduces CharGen, a character-level multimodal encoder coupled with an ODM-based multi-scale CharGen perceptual loss, integrated into a ControlNet diffusion framework with Long-CLIP to handle long text captions. The method achieves state-of-the-art performance on English and Chinese benchmarks (AnyText-benchmark and MARIO-Eval), notably improving Sen.ACC by about 8.8% on English and 5.5% on Chinese, while delivering sharper glyphs for complex, multi-stroke characters. These advances enhance both the fidelity of rendered visual text and the reliability of visual-text editing in multilingual scenes, with practical implications for OCR-aware image synthesis and downstream text-centric applications. The training objective combines diffusion control with glyph-focused supervision via , where is tuned (e.g., ), and the system leverages Long-CLIP to expand text capacity to tokens.

Abstract

Recently, significant advancements have been made in diffusion-based visual text generation models. Although the effectiveness of these methods in visual text rendering is rapidly improving, they still encounter challenges such as inaccurate characters and strokes when rendering complex visual text. In this paper, we propose CharGen, a highly accurate character-level visual text generation and editing model. Specifically, CharGen employs a character-level multimodal encoder that not only extracts character-level text embeddings but also encodes glyph images character by character. This enables it to capture fine-grained cross-modality features more effectively. Additionally, we introduce a new perceptual loss in CharGen to enhance character shape supervision and address the issue of inaccurate strokes in generated text. It is worth mentioning that CharGen can be integrated into existing diffusion models to generate visual text with high accuracy. CharGen significantly improves text rendering accuracy, outperforming recent methods in public benchmarks such as AnyText-benchmark and MARIO-Eval, with improvements of more than 8% and 6%, respectively. Notably, CharGen achieved a 5.5% increase in accuracy on Chinese test sets.

Paper Structure

This paper contains 16 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (A.1) A conventional text encoder without visual glyph information. (A.2) A character-level text embedding that indirectly incorporates glyph information. (A.3) A block-level visual embedding that replaces text embedding. (A.4) A character-level multi-modal encoder. (B.1) Based on an OCR detection model. (B.2) Based on an OCR recognition model. (B.3) Based on an ODM pre-trained model.
  • Figure 2: The framework of CharGen.
  • Figure 3: Visual text image generated by CharGen.
  • Figure 4: A qualitative comparison of CharGen with AnyText on English and Chinese text generation, using test captions from the AnyText-benchmark dataset.
  • Figure 5: A qualitative comparison of CharGen with AnyText on visual text editing.