Table of Contents
Fetching ...

One-Shot Multilingual Font Generation Via ViT

Zhiheng Wang, Jiarui Liu

TL;DR

This work tackles the challenge of one-shot multilingual font generation for both logographic and alphabetic scripts, including unseen and user-created characters. It introduces a Vision Transformer (ViT)–based framework pretrained with Masked Autoencoding (MAE) and employs a cross-attention bi-encoder to fuse content and style representations, producing glyphs across languages without strict reference constraints. A Retrieval-Augmented Guidance (RAG) module using FAISS enables dynamic style-reference retrieval to handle difficult inputs, complementing the main model. Across extensive experiments and human evaluations, the approach demonstrates strong generalization to unseen content and styles, cross-language transfer capabilities, and robustness, with practical implications for scalable, real-world font design.

Abstract

Font design poses unique challenges for logographic languages like Chinese, Japanese, and Korean (CJK), where thousands of unique characters must be individually crafted. This paper introduces a novel Vision Transformer (ViT)-based model for multi-language font generation, effectively addressing the complexities of both logographic and alphabetic scripts. By leveraging ViT and pretraining with a strong visual pretext task (Masked Autoencoding, MAE), our model eliminates the need for complex design components in prior frameworks while achieving comprehensive results with enhanced generalizability. Remarkably, it can generate high-quality fonts across multiple languages for unseen, unknown, and even user-crafted characters. Additionally, we integrate a Retrieval-Augmented Guidance (RAG) module to dynamically retrieve and adapt style references, improving scalability and real-world applicability. We evaluated our approach in various font generation tasks, demonstrating its effectiveness, adaptability, and scalability.

One-Shot Multilingual Font Generation Via ViT

TL;DR

This work tackles the challenge of one-shot multilingual font generation for both logographic and alphabetic scripts, including unseen and user-created characters. It introduces a Vision Transformer (ViT)–based framework pretrained with Masked Autoencoding (MAE) and employs a cross-attention bi-encoder to fuse content and style representations, producing glyphs across languages without strict reference constraints. A Retrieval-Augmented Guidance (RAG) module using FAISS enables dynamic style-reference retrieval to handle difficult inputs, complementing the main model. Across extensive experiments and human evaluations, the approach demonstrates strong generalization to unseen content and styles, cross-language transfer capabilities, and robustness, with practical implications for scalable, real-world font design.

Abstract

Font design poses unique challenges for logographic languages like Chinese, Japanese, and Korean (CJK), where thousands of unique characters must be individually crafted. This paper introduces a novel Vision Transformer (ViT)-based model for multi-language font generation, effectively addressing the complexities of both logographic and alphabetic scripts. By leveraging ViT and pretraining with a strong visual pretext task (Masked Autoencoding, MAE), our model eliminates the need for complex design components in prior frameworks while achieving comprehensive results with enhanced generalizability. Remarkably, it can generate high-quality fonts across multiple languages for unseen, unknown, and even user-crafted characters. Additionally, we integrate a Retrieval-Augmented Guidance (RAG) module to dynamically retrieve and adapt style references, improving scalability and real-world applicability. We evaluated our approach in various font generation tasks, demonstrating its effectiveness, adaptability, and scalability.

Paper Structure

This paper contains 18 sections, 2 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: From top to bottom are Chinese, Japanese, Korean and English. The five styles are randomly picked from the total of 308 styles.
  • Figure 2: The comparison showed that it is necessary to pretrain the VitMAE model in our dataset. This will help us to start with a confident encoder and decoder for the main model.
  • Figure 3: Our proposed model utilizes a cross-attention mechanism to guide the fusion of content and style embeddings, enhancing the flexibility and fidelity of glyph generation. Noted, the pink boxed RAG module is an add on to our main model. The green boxed font images are content input, and the orange boxed font image is style input.
  • Figure 4: DiffuserFont captures content but deviates significantly in style.
  • Figure 5: In each row, the first image is the content image, the second image is the style image, the third image is the ground truth image, and the last image is the generated image. The example is randomly picked from four testing set.
  • ...and 7 more figures