FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

Xia Xin; Yuki Endo; Yoshihiro Kanamori

FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

Xia Xin, Yuki Endo, Yoshihiro Kanamori

TL;DR

Experiments show that models trained with the annotation pipeline produce text renderings more consistent with prompts than competitive baselines, and a Long-CLIP-based metric that measures alignment between generated typography and requested attributes is introduced.

Abstract

Recent text-to-image models can generate high-quality images from natural-language prompts, yet controlling typography remains challenging: requested typographic appearance is often ignored or only weakly followed. We address this limitation with a data-centric approach that trains image generation models using targeted supervision derived from a structured annotation pipeline specialized for typography. Our pipeline constructs a large-scale typography-focused dataset, FontUse, consisting of about 70K images annotated with user-friendly prompts, text-region locations, and OCR-recognized strings. The annotations are automatically produced using segmentation models and multimodal large language models (MLLMs). The prompts explicitly combine font styles (e.g., serif, script, elegant) and use cases (e.g., wedding invitations, coffee-shop menus), enabling intuitive specification even for novice users. Fine-tuning existing generators with these annotations allows them to consistently interpret style and use-case conditions as textual prompts without architectural modification. For evaluation, we introduce a Long-CLIP-based metric that measures alignment between generated typography and requested attributes. Experiments across diverse prompts and layouts show that models trained with our pipeline produce text renderings more consistent with prompts than competitive baselines. The source code for our annotation pipeline is available at https://github.com/xiaxinz/FontUSE.

FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

TL;DR

Abstract

Paper Structure (16 sections, 13 figures, 5 tables)

This paper contains 16 sections, 13 figures, 5 tables.

Introduction
Related Work
Method
Dataset Construction
Prompt Design and Rationale
Dataset Composition and Statistics
Experiments
Implementation Details
Evaluation Metrics
Comparison with Baselines
Reliability Analysis of Evaluation Metrics
Evaluation of the OCR Pipeline
Conclusions
Details of Input Prompts
Details of Prompts for MLLM Evaluation
...and 1 more sections

Figures (13)

Figure 1: Fine-tuned models with our supervision (FontUse) generate legible target strings consistent with specified font styles and use cases. Top: Under identical prompts, fine-tuned models show improved results over baselines in both inpainting and full-image generation. Bottom: Additional examples illustrate diverse glyph forms and stylistic effects (backgrounds cropped; target strings omitted).
Figure 2: Dataset construction pipeline. (a) Typography-focused images are collected from public design resources. (b) Hi-SAM Ye2024HiSAMMS detects text regions and outputs bounding boxes. (c)-(e) An MLLM performs text recognition, font-style annotation, and use-case annotation.
Figure 3: Example of the input prompt and an MLLM responses. See Appendix \ref{['app:appendix-gpt']} for details of 〈 example format〉.
Figure 4: Prompt used for text recognition. The MLLM is instructed to output only the extracted word (case-sensitive), returning "-" if no text is detected and "#" for non-Roman characters.
Figure 5: Top-40 style and use-case families after consolidation, illustrating their frequency distributions.
...and 8 more figures

FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

TL;DR

Abstract

FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

Authors

TL;DR

Abstract

Table of Contents

Figures (13)