Table of Contents
Fetching ...

Impression-CLIP: Contrastive Shape-Impression Embedding for Fonts

Yugo Kubota, Daichi Haraguchi, Seiichi Uchida

TL;DR

Impression-CLIP addresses the challenge of weak, subjective correlations between font shapes and reader impressions by adapting CLIP-style contrastive learning to co-embed font images and impression tags in a shared latent space. The method uses a pre-trained font-shape autoencoder and a pre-trained CLIP text encoder, with learned MLPs to produce unit-normalized embeddings that are trained via a symmetric cross-entropy loss to pull matching font-impression pairs together while pushing non-matching pairs apart. Empirical results on a large MyFonts dataset show improved cross-modal retrieval (ARR and mAP) over the state-of-the-art Cross-AE, and qualitative analyses confirm more coherent and robust retrieval under noisy or missing tags. The work demonstrates that CLIP-inspired contrastive learning can reveal stable cross-modal relationships between typography and impression, enabling practical tasks like impression-based font retrieval and impression estimation, while highlighting limitations due to tag imbalance and noise.

Abstract

Fonts convey different impressions to readers. These impressions often come from the font shapes. However, the correlation between fonts and their impression is weak and unstable because impressions are subjective. To capture such weak and unstable cross-modal correlation between font shapes and their impressions, we propose Impression-CLIP, which is a novel machine-learning model based on CLIP (Contrastive Language-Image Pre-training). By using the CLIP-based model, font image features and their impression features are pulled closer, and font image features and unrelated impression features are pushed apart. This procedure realizes co-embedding between font image and their impressions. In our experiment, we perform cross-modal retrieval between fonts and impressions through co-embedding. The results indicate that Impression-CLIP achieves better retrieval accuracy than the state-of-the-art method. Additionally, our model shows the robustness to noise and missing tags.

Impression-CLIP: Contrastive Shape-Impression Embedding for Fonts

TL;DR

Impression-CLIP addresses the challenge of weak, subjective correlations between font shapes and reader impressions by adapting CLIP-style contrastive learning to co-embed font images and impression tags in a shared latent space. The method uses a pre-trained font-shape autoencoder and a pre-trained CLIP text encoder, with learned MLPs to produce unit-normalized embeddings that are trained via a symmetric cross-entropy loss to pull matching font-impression pairs together while pushing non-matching pairs apart. Empirical results on a large MyFonts dataset show improved cross-modal retrieval (ARR and mAP) over the state-of-the-art Cross-AE, and qualitative analyses confirm more coherent and robust retrieval under noisy or missing tags. The work demonstrates that CLIP-inspired contrastive learning can reveal stable cross-modal relationships between typography and impression, enabling practical tasks like impression-based font retrieval and impression estimation, while highlighting limitations due to tag imbalance and noise.

Abstract

Fonts convey different impressions to readers. These impressions often come from the font shapes. However, the correlation between fonts and their impression is weak and unstable because impressions are subjective. To capture such weak and unstable cross-modal correlation between font shapes and their impressions, we propose Impression-CLIP, which is a novel machine-learning model based on CLIP (Contrastive Language-Image Pre-training). By using the CLIP-based model, font image features and their impression features are pulled closer, and font image features and unrelated impression features are pushed apart. This procedure realizes co-embedding between font image and their impressions. In our experiment, we perform cross-modal retrieval between fonts and impressions through co-embedding. The results indicate that Impression-CLIP achieves better retrieval accuracy than the state-of-the-art method. Additionally, our model shows the robustness to noise and missing tags.
Paper Structure (20 sections, 1 equation, 9 figures, 1 table)

This paper contains 20 sections, 1 equation, 9 figures, 1 table.

Figures (9)

  • Figure 1: Font styles and their impression tags.
  • Figure 2: Overview of Impression-CLIP, contrastive shape-impression embedding.
  • Figure 3: Visualization of feature distributions before and after contrastive learning by Impression-CLIP.
  • Figure 6: Font image retrieval result by a set of impression tags. The three fonts retrieved are listed from top to bottom in order of top 1 to 3. Note the letters "ABC" and "HERONS" are selected to observe various local shape variations, such as curves and corners.
  • Figure 7: Impression retrieval (i.e., estimation) result by a query font image. The three impression sets retrieved are listed from top to bottom in order of top 1 to 3.
  • ...and 4 more figures