Table of Contents
Fetching ...

GRIF-DM: Generation of Rich Impression Fonts using Diffusion Models

Lei Kang, Fei Yang, Kai Wang, Mohamed Ali Souibgui, Lluis Gomez, Alicia Fornés, Ernest Valveny, Dimosthenis Karatzas

TL;DR

GRIF-DM introduces a diffusion-based framework for generating rich impression fonts from a single letter and a set of impression keywords, addressing GAN-era auxiliary-loss limitations and explicit vector fusion shortcomings. It deploys a U-Net with dual cross-attention to separately encode letter and impression information, fused via CrossAttn-IMP and CrossAttn-LET, with text representations from frozen BERT embeddings $c_{imp}$ and $c_{let}$. The model optimizes a conditional diffusion objective $L = \mathbb{E}_{x_0,\epsilon,t} \| \epsilon - \epsilon_{\theta}(x_t, t, [c_{let}, c_{imp}]) \|^2$ under forward $q(x_t|x_{t-1})$ and learned reverse transitions $p_{\theta}(x_{t-1}|x_t,[c_{let},c_{imp}])$, enabling robust generation across variable keyword lengths. Empirically, GRIF-DM achieves strong FID/Intra-FID results on MyFonts (e.g., FID $=6.693$ with 5k samples and Intra-FID $=43.119$), demonstrates diverse and faithful font outputs, and shows resilience to out-of-vocabulary keywords via semantic embeddings; limitations include computational cost and coverage limited to the English alphabet, with future work aiming to incorporate LLMs for natural-language conditioning and to broaden language support.

Abstract

Fonts are integral to creative endeavors, design processes, and artistic productions. The appropriate selection of a font can significantly enhance artwork and endow advertisements with a higher level of expressivity. Despite the availability of numerous diverse font designs online, traditional retrieval-based methods for font selection are increasingly being supplanted by generation-based approaches. These newer methods offer enhanced flexibility, catering to specific user preferences and capturing unique stylistic impressions. However, current impression font techniques based on Generative Adversarial Networks (GANs) necessitate the utilization of multiple auxiliary losses to provide guidance during generation. Furthermore, these methods commonly employ weighted summation for the fusion of impression-related keywords. This leads to generic vectors with the addition of more impression keywords, ultimately lacking in detail generation capacity. In this paper, we introduce a diffusion-based method, termed \ourmethod, to generate fonts that vividly embody specific impressions, utilizing an input consisting of a single letter and a set of descriptive impression keywords. The core innovation of \ourmethod lies in the development of dual cross-attention modules, which process the characteristics of the letters and impression keywords independently but synergistically, ensuring effective integration of both types of information. Our experimental results, conducted on the MyFonts dataset, affirm that this method is capable of producing realistic, vibrant, and high-fidelity fonts that are closely aligned with user specifications. This confirms the potential of our approach to revolutionize font generation by accommodating a broad spectrum of user-driven design requirements. Our code is publicly available at \url{https://github.com/leitro/GRIF-DM}.

GRIF-DM: Generation of Rich Impression Fonts using Diffusion Models

TL;DR

GRIF-DM introduces a diffusion-based framework for generating rich impression fonts from a single letter and a set of impression keywords, addressing GAN-era auxiliary-loss limitations and explicit vector fusion shortcomings. It deploys a U-Net with dual cross-attention to separately encode letter and impression information, fused via CrossAttn-IMP and CrossAttn-LET, with text representations from frozen BERT embeddings and . The model optimizes a conditional diffusion objective under forward and learned reverse transitions , enabling robust generation across variable keyword lengths. Empirically, GRIF-DM achieves strong FID/Intra-FID results on MyFonts (e.g., FID with 5k samples and Intra-FID ), demonstrates diverse and faithful font outputs, and shows resilience to out-of-vocabulary keywords via semantic embeddings; limitations include computational cost and coverage limited to the English alphabet, with future work aiming to incorporate LLMs for natural-language conditioning and to broaden language support.

Abstract

Fonts are integral to creative endeavors, design processes, and artistic productions. The appropriate selection of a font can significantly enhance artwork and endow advertisements with a higher level of expressivity. Despite the availability of numerous diverse font designs online, traditional retrieval-based methods for font selection are increasingly being supplanted by generation-based approaches. These newer methods offer enhanced flexibility, catering to specific user preferences and capturing unique stylistic impressions. However, current impression font techniques based on Generative Adversarial Networks (GANs) necessitate the utilization of multiple auxiliary losses to provide guidance during generation. Furthermore, these methods commonly employ weighted summation for the fusion of impression-related keywords. This leads to generic vectors with the addition of more impression keywords, ultimately lacking in detail generation capacity. In this paper, we introduce a diffusion-based method, termed \ourmethod, to generate fonts that vividly embody specific impressions, utilizing an input consisting of a single letter and a set of descriptive impression keywords. The core innovation of \ourmethod lies in the development of dual cross-attention modules, which process the characteristics of the letters and impression keywords independently but synergistically, ensuring effective integration of both types of information. Our experimental results, conducted on the MyFonts dataset, affirm that this method is capable of producing realistic, vibrant, and high-fidelity fonts that are closely aligned with user specifications. This confirms the potential of our approach to revolutionize font generation by accommodating a broad spectrum of user-driven design requirements. Our code is publicly available at \url{https://github.com/leitro/GRIF-DM}.
Paper Structure (18 sections, 7 equations, 5 figures, 4 tables)

This paper contains 18 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The illustration of the problem setup of font generation. Our method, GRIF-DM, generates desired fonts based on user input of impression keywords and letters.
  • Figure 2: Architecture of our proposed method, which includes an Encoder, a Bottleneck, and a Decoder modules in purple, grey, and green respectively. Two frozen weights BERT modules are used to obtain embeddings of impression keywords and a letter, integrated via dual cross-attention modules: impression cross-attention in blue and letter cross-attention in red.
  • Figure 3: Qualitative results for font diversity. The top row displays three font names from the test set. Groundtruth images of "L," "E," "A," and "F" are enclosed in blue boxes, while generated font images are in orange dashed boxes, with each row starting from different random noise.
  • Figure 4: Exploration of Impression Keywords. Groundtruth font images of letters "H", "E", "R", and "O" enclosed in blue boxes with corresponding impression keywords to the left in the first row. Subsequent font images are generated by GRIF-DM conditioned on the corresponding impression keywords to the left, with modifications highlighted in red.
  • Figure 5: Font image generation with specified impression labels. Following the experimental setup from matsuda2022font, we utilize the letters "A", "B", "C", "H", "E", "R", "O", "N", and "S" due to their inclusion of the majority of strokes in Latin capital alphabets.