Table of Contents
Fetching ...

FontCLIP: A Semantic Typography Visual-Language Model for Multilingual Font Applications

Yuki Tatsukawa, I-Chao Shen, Anran Qi, Yuki Koyama, Takeo Igarashi, Ariel Shamir

TL;DR

This work proposes to use a compound descriptive prompt that encapsulates adaptively sampled attributes from a font attribute dataset focusing on Roman alphabet characters to integrate typography‐specific knowledge into the comprehensive vision‐language knowledge of a pretrained CLIP model through a novel finetuning approach.

Abstract

Acquiring the desired font for various design tasks can be challenging and requires professional typographic knowledge. While previous font retrieval or generation works have alleviated some of these difficulties, they often lack support for multiple languages and semantic attributes beyond the training data domains. To solve this problem, we present FontCLIP: a model that connects the semantic understanding of a large vision-language model with typographical knowledge. We integrate typography-specific knowledge into the comprehensive vision-language knowledge of a pretrained CLIP model through a novel finetuning approach. We propose to use a compound descriptive prompt that encapsulates adaptively sampled attributes from a font attribute dataset focusing on Roman alphabet characters. FontCLIP's semantic typographic latent space demonstrates two unprecedented generalization abilities. First, FontCLIP generalizes to different languages including Chinese, Japanese, and Korean (CJK), capturing the typographical features of fonts across different languages, even though it was only finetuned using fonts of Roman characters. Second, FontCLIP can recognize the semantic attributes that are not presented in the training data. FontCLIP's dual-modality and generalization abilities enable multilingual and cross-lingual font retrieval and letter shape optimization, reducing the burden of obtaining desired fonts.

FontCLIP: A Semantic Typography Visual-Language Model for Multilingual Font Applications

TL;DR

This work proposes to use a compound descriptive prompt that encapsulates adaptively sampled attributes from a font attribute dataset focusing on Roman alphabet characters to integrate typography‐specific knowledge into the comprehensive vision‐language knowledge of a pretrained CLIP model through a novel finetuning approach.

Abstract

Acquiring the desired font for various design tasks can be challenging and requires professional typographic knowledge. While previous font retrieval or generation works have alleviated some of these difficulties, they often lack support for multiple languages and semantic attributes beyond the training data domains. To solve this problem, we present FontCLIP: a model that connects the semantic understanding of a large vision-language model with typographical knowledge. We integrate typography-specific knowledge into the comprehensive vision-language knowledge of a pretrained CLIP model through a novel finetuning approach. We propose to use a compound descriptive prompt that encapsulates adaptively sampled attributes from a font attribute dataset focusing on Roman alphabet characters. FontCLIP's semantic typographic latent space demonstrates two unprecedented generalization abilities. First, FontCLIP generalizes to different languages including Chinese, Japanese, and Korean (CJK), capturing the typographical features of fonts across different languages, even though it was only finetuned using fonts of Roman characters. Second, FontCLIP can recognize the semantic attributes that are not presented in the training data. FontCLIP's dual-modality and generalization abilities enable multilingual and cross-lingual font retrieval and letter shape optimization, reducing the burden of obtaining desired fonts.
Paper Structure (5 sections, 4 equations, 6 figures)

This paper contains 5 sections, 4 equations, 6 figures.

Figures (6)

  • Figure 9: An overview of our multi-modal vector font optimization. Given an input letter $l$ ("A" in this example) represented as a set of outline control points $P$, and either a language-driven descriptive prompt $T_{\text{user}}$ (a), or a visual-driven reference font image $I_{\text{user}}$ (b), we iteratively optimize the new positions of $\hat{P}$ creating the optimized letter shape $\hat{l}$. Inspired by iluz2023wordasimage, we first rasterize the deformed letter $\hat{l}$ by a differentiable rasterizer (DiffVG). To guide the optimization, we use a language loss $L_{\text{language}}$ in (a) language-driven optimization, or a visual loss $L_{\text{visual}}$ in (b) visual-driven optimization to ensure $\hat{l}$ aligns with desired attributes indicated by the descriptive prompt or the reference font image. Moreover, our objective function includes the tone preservation loss $L_{\text{tone}}$ and an ACAP deformation loss $L_{\text{acap}}$ similar to iluz2023wordasimage. Black and red dashed arrows indicate forward and backward computation, respectively.
  • Figure 10: Visualization of the vector font optimization steps of the language-driven Roman and Chinese character optimization using FontCLIP. We compared the results obtained by Word-As-Image iluz2023wordasimage and our method. Our method better captures and reconstructs each character's typographical features, including features such as serif.
  • Figure 11: Ablation study on the language-driven font optimization. Given (a) an input font, we compare the results obtained by (b) replacing $L_{\text{language}}$ into SDS loss, (c) our method using only $T_{\text{user}}$, and (d) our method using $T_{\text{final}}$. (The user specificed attributes are shown in blue and the attributes to be preserved are shown in red.)
  • Figure 12: Visualization of the optimization steps of the cross-lingual image-driven Roman and Chinese character optimization.
  • Figure 13: (a) Given a reference font image captured in real-world, our optimization method uses (b) the extracted letters to manipulate (c) the input letters. (d) The optimized letters exhibits a similar style to the fonts in the captured image.
  • ...and 1 more figures