Impression-CLIP: Contrastive Shape-Impression Embedding for Fonts
Yugo Kubota, Daichi Haraguchi, Seiichi Uchida
TL;DR
Impression-CLIP addresses the challenge of weak, subjective correlations between font shapes and reader impressions by adapting CLIP-style contrastive learning to co-embed font images and impression tags in a shared latent space. The method uses a pre-trained font-shape autoencoder and a pre-trained CLIP text encoder, with learned MLPs to produce unit-normalized embeddings that are trained via a symmetric cross-entropy loss to pull matching font-impression pairs together while pushing non-matching pairs apart. Empirical results on a large MyFonts dataset show improved cross-modal retrieval (ARR and mAP) over the state-of-the-art Cross-AE, and qualitative analyses confirm more coherent and robust retrieval under noisy or missing tags. The work demonstrates that CLIP-inspired contrastive learning can reveal stable cross-modal relationships between typography and impression, enabling practical tasks like impression-based font retrieval and impression estimation, while highlighting limitations due to tag imbalance and noise.
Abstract
Fonts convey different impressions to readers. These impressions often come from the font shapes. However, the correlation between fonts and their impression is weak and unstable because impressions are subjective. To capture such weak and unstable cross-modal correlation between font shapes and their impressions, we propose Impression-CLIP, which is a novel machine-learning model based on CLIP (Contrastive Language-Image Pre-training). By using the CLIP-based model, font image features and their impression features are pulled closer, and font image features and unrelated impression features are pushed apart. This procedure realizes co-embedding between font image and their impressions. In our experiment, we perform cross-modal retrieval between fonts and impressions through co-embedding. The results indicate that Impression-CLIP achieves better retrieval accuracy than the state-of-the-art method. Additionally, our model shows the robustness to noise and missing tags.
