ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations
Bowen Jiang, Yuan Yuan, Xinyi Bai, Zhuoqun Hao, Alyson Yin, Yaojie Hu, Wenyu Liao, Lyle Ungar, Camillo J. Taylor
TL;DR
ControlText introduces a data-driven, font-agnostic diffusion framework for multilingual visual text rendering that uses segmentation-based glyph masks as pixel-space font controls, eliminating the need for ground-truth font labels. The model is trained with a two-part pipeline: a training stage that collects font-aware glyphs and applies perspective perturbations, and an inference stage that enables user-directed editing by supplying text, font files, and regions, followed by blending into the original image. It provides novel metrics for evaluating fuzzy fonts with a pretrained font classifier and demonstrates zero-shot generalization to unseen languages while preserving font details, across diverse scripts and fonts. The work offers a scalable, open-world approach with a public codebase and evaluation framework, advancing practical multilingual visual text rendering and font customization.
Abstract
This work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations.Visual text rendering remains a significant challenge. While recent methods condition diffusion on glyphs, it is impossible to retrieve exact font annotations from large-scale, real-world datasets, which prevents user-specified font control. To address this, we propose a data-driven solution that integrates the conditional diffusion model with a text segmentation model, utilizing segmentation masks to capture and represent fonts in pixel space in a self-supervised manner, thereby eliminating the need for any ground-truth labels and enabling users to customize text rendering with any multilingual font of their choice. The experiment provides a proof of concept of our algorithm in zero-shot text and font editing across diverse fonts and languages, providing valuable insights for the community and industry toward achieving generalized visual text rendering. Code is available at github.com/bowen-upenn/ControlText.
