CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder
Lichen Ma, Tiezhu Yue, Pei Fu, Yujie Zhong, Kai Zhou, Xiaoming Wei, Jie Hu
TL;DR
This work tackles the persistent challenge of inaccurate character rendering in diffusion-based visual text generation. It introduces CharGen, a character-level multimodal encoder coupled with an ODM-based multi-scale CharGen perceptual loss, integrated into a ControlNet diffusion framework with Long-CLIP to handle long text captions. The method achieves state-of-the-art performance on English and Chinese benchmarks (AnyText-benchmark and MARIO-Eval), notably improving Sen.ACC by about 8.8% on English and 5.5% on Chinese, while delivering sharper glyphs for complex, multi-stroke characters. These advances enhance both the fidelity of rendered visual text and the reliability of visual-text editing in multilingual scenes, with practical implications for OCR-aware image synthesis and downstream text-centric applications. The training objective combines diffusion control with glyph-focused supervision via $L = L_{cd} + \lambda L_{chargen}$, where $\lambda$ is tuned (e.g., $0.01$), and the system leverages Long-CLIP to expand text capacity to $248$ tokens.
Abstract
Recently, significant advancements have been made in diffusion-based visual text generation models. Although the effectiveness of these methods in visual text rendering is rapidly improving, they still encounter challenges such as inaccurate characters and strokes when rendering complex visual text. In this paper, we propose CharGen, a highly accurate character-level visual text generation and editing model. Specifically, CharGen employs a character-level multimodal encoder that not only extracts character-level text embeddings but also encodes glyph images character by character. This enables it to capture fine-grained cross-modality features more effectively. Additionally, we introduce a new perceptual loss in CharGen to enhance character shape supervision and address the issue of inaccurate strokes in generated text. It is worth mentioning that CharGen can be integrated into existing diffusion models to generate visual text with high accuracy. CharGen significantly improves text rendering accuracy, outperforming recent methods in public benchmarks such as AnyText-benchmark and MARIO-Eval, with improvements of more than 8% and 6%, respectively. Notably, CharGen achieved a 5.5% increase in accuracy on Chinese test sets.
