Table of Contents
Fetching ...

ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations

Bowen Jiang, Yuan Yuan, Xinyi Bai, Zhuoqun Hao, Alyson Yin, Yaojie Hu, Wenyu Liao, Lyle Ungar, Camillo J. Taylor

TL;DR

ControlText introduces a data-driven, font-agnostic diffusion framework for multilingual visual text rendering that uses segmentation-based glyph masks as pixel-space font controls, eliminating the need for ground-truth font labels. The model is trained with a two-part pipeline: a training stage that collects font-aware glyphs and applies perspective perturbations, and an inference stage that enables user-directed editing by supplying text, font files, and regions, followed by blending into the original image. It provides novel metrics for evaluating fuzzy fonts with a pretrained font classifier and demonstrates zero-shot generalization to unseen languages while preserving font details, across diverse scripts and fonts. The work offers a scalable, open-world approach with a public codebase and evaluation framework, advancing practical multilingual visual text rendering and font customization.

Abstract

This work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations.Visual text rendering remains a significant challenge. While recent methods condition diffusion on glyphs, it is impossible to retrieve exact font annotations from large-scale, real-world datasets, which prevents user-specified font control. To address this, we propose a data-driven solution that integrates the conditional diffusion model with a text segmentation model, utilizing segmentation masks to capture and represent fonts in pixel space in a self-supervised manner, thereby eliminating the need for any ground-truth labels and enabling users to customize text rendering with any multilingual font of their choice. The experiment provides a proof of concept of our algorithm in zero-shot text and font editing across diverse fonts and languages, providing valuable insights for the community and industry toward achieving generalized visual text rendering. Code is available at github.com/bowen-upenn/ControlText.

ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations

TL;DR

ControlText introduces a data-driven, font-agnostic diffusion framework for multilingual visual text rendering that uses segmentation-based glyph masks as pixel-space font controls, eliminating the need for ground-truth font labels. The model is trained with a two-part pipeline: a training stage that collects font-aware glyphs and applies perspective perturbations, and an inference stage that enables user-directed editing by supplying text, font files, and regions, followed by blending into the original image. It provides novel metrics for evaluating fuzzy fonts with a pretrained font classifier and demonstrates zero-shot generalization to unseen languages while preserving font details, across diverse scripts and fonts. The work offers a scalable, open-world approach with a public codebase and evaluation framework, advancing practical multilingual visual text rendering and font customization.

Abstract

This work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations.Visual text rendering remains a significant challenge. While recent methods condition diffusion on glyphs, it is impossible to retrieve exact font annotations from large-scale, real-world datasets, which prevents user-specified font control. To address this, we propose a data-driven solution that integrates the conditional diffusion model with a text segmentation model, utilizing segmentation masks to capture and represent fonts in pixel space in a self-supervised manner, thereby eliminating the need for any ground-truth labels and enabling users to customize text rendering with any multilingual font of their choice. The experiment provides a proof of concept of our algorithm in zero-shot text and font editing across diverse fonts and languages, providing valuable insights for the community and industry toward achieving generalized visual text rendering. Code is available at github.com/bowen-upenn/ControlText.

Paper Structure

This paper contains 30 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Examples of real-world test images with text generated by ControlText in various fonts and languages. Each row presents both the rendered images and the textual part of the corresponding glyph controls that provide the text and the intricate font information in pixel space.
  • Figure 2: System overview. It consists of two parts (1) Training pipeline: text segmentation masks are extracted as glyph controls from a large image dataset without ground-truth font annotations. Low-quality masks are filtered out using an OCR model, and random perturbations are applied to prevent the model from overfitting to exact pixel locations of the glyphs. (2) Inference pipeline: users upload images, specify text regions, and provide any desired font file through the user front-end. The model generates an image patch with the rendered text, which is then seamlessly blended into the original image. Throughout this figure, models marked with a fire icon indicate trainable weights, while those marked with a snowflake icon are frozen.
  • Figure 3: Evaluation pipeline: the cropped regions of the generated text and the input glyph are processed by a pretrained font classification model, which may not have seen the user-specified font. The proposed $l_2@k$ and $\cos@k$ metrics for fuzzy fonts assume that similar fonts have similar output probability vectors, while we retain only top-$k$ values while zeroing out the rest.
  • Figure 4: Continuation of Figure \ref{['fig:top']}. Examples of real-world and AI-generated images with text generated by ControlText in various fonts and languages. Each row presents both the rendered images and the textual part of their glyph controls. We also try the most complex Chinese character, "biang", in the bottom row, accompanied by a zoomed-in view of the rendered character. ControlText effectively renders text with realistic integration into backgrounds while maintaining correct letters and characters in their user specified fonts.