Table of Contents
Fetching ...

FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

Xia Xin, Yuki Endo, Yoshihiro Kanamori

TL;DR

Experiments show that models trained with the annotation pipeline produce text renderings more consistent with prompts than competitive baselines, and a Long-CLIP-based metric that measures alignment between generated typography and requested attributes is introduced.

Abstract

Recent text-to-image models can generate high-quality images from natural-language prompts, yet controlling typography remains challenging: requested typographic appearance is often ignored or only weakly followed. We address this limitation with a data-centric approach that trains image generation models using targeted supervision derived from a structured annotation pipeline specialized for typography. Our pipeline constructs a large-scale typography-focused dataset, FontUse, consisting of about 70K images annotated with user-friendly prompts, text-region locations, and OCR-recognized strings. The annotations are automatically produced using segmentation models and multimodal large language models (MLLMs). The prompts explicitly combine font styles (e.g., serif, script, elegant) and use cases (e.g., wedding invitations, coffee-shop menus), enabling intuitive specification even for novice users. Fine-tuning existing generators with these annotations allows them to consistently interpret style and use-case conditions as textual prompts without architectural modification. For evaluation, we introduce a Long-CLIP-based metric that measures alignment between generated typography and requested attributes. Experiments across diverse prompts and layouts show that models trained with our pipeline produce text renderings more consistent with prompts than competitive baselines. The source code for our annotation pipeline is available at https://github.com/xiaxinz/FontUSE.

FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

TL;DR

Experiments show that models trained with the annotation pipeline produce text renderings more consistent with prompts than competitive baselines, and a Long-CLIP-based metric that measures alignment between generated typography and requested attributes is introduced.

Abstract

Recent text-to-image models can generate high-quality images from natural-language prompts, yet controlling typography remains challenging: requested typographic appearance is often ignored or only weakly followed. We address this limitation with a data-centric approach that trains image generation models using targeted supervision derived from a structured annotation pipeline specialized for typography. Our pipeline constructs a large-scale typography-focused dataset, FontUse, consisting of about 70K images annotated with user-friendly prompts, text-region locations, and OCR-recognized strings. The annotations are automatically produced using segmentation models and multimodal large language models (MLLMs). The prompts explicitly combine font styles (e.g., serif, script, elegant) and use cases (e.g., wedding invitations, coffee-shop menus), enabling intuitive specification even for novice users. Fine-tuning existing generators with these annotations allows them to consistently interpret style and use-case conditions as textual prompts without architectural modification. For evaluation, we introduce a Long-CLIP-based metric that measures alignment between generated typography and requested attributes. Experiments across diverse prompts and layouts show that models trained with our pipeline produce text renderings more consistent with prompts than competitive baselines. The source code for our annotation pipeline is available at https://github.com/xiaxinz/FontUSE.
Paper Structure (16 sections, 13 figures, 5 tables)

This paper contains 16 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Fine-tuned models with our supervision (FontUse) generate legible target strings consistent with specified font styles and use cases. Top: Under identical prompts, fine-tuned models show improved results over baselines in both inpainting and full-image generation. Bottom: Additional examples illustrate diverse glyph forms and stylistic effects (backgrounds cropped; target strings omitted).
  • Figure 2: Dataset construction pipeline. (a) Typography-focused images are collected from public design resources. (b) Hi-SAM Ye2024HiSAMMS detects text regions and outputs bounding boxes. (c)-(e) An MLLM performs text recognition, font-style annotation, and use-case annotation.
  • Figure 3: Example of the input prompt and an MLLM responses. See Appendix \ref{['app:appendix-gpt']} for details of 〈 example format〉.
  • Figure 4: Prompt used for text recognition. The MLLM is instructed to output only the extracted word (case-sensitive), returning "-" if no text is detected and "#" for non-Roman characters.
  • Figure 5: Top-40 style and use-case families after consolidation, illustrating their frequency distributions.
  • ...and 8 more figures