Learning to Kern: Set-wise Estimation of Optimal Letter Space
Kei Nakatsuru, Seiichi Uchida
TL;DR
The paper addresses automatic kerning for Latin letters, where $52^2=2704$ spaces must be determined per font. It introduces a pairwise DNN and a set-wise Transformer to estimate all letter-pair spaces, with the set-wise approach leveraging self-attention to enforce global consistency. On a dataset of about 2558 Google Fonts, the set-wise model achieves an average MAE of roughly $5.3$ pixels when the mean space is around $115$ pixels, outperforming FontForge and the pairwise baseline in most cases. This work demonstrates a practical, holistic approach to kerning that can accelerate font design and suggests avenues for designer-guided adjustments and vector-font integration.
Abstract
Kerning is the task of setting appropriate horizontal spaces for all possible letter pairs of a certain font. One of the difficulties of kerning is that the appropriate space differs for each letter pair. Therefore, for a total of 52 capital and small letters, we need to adjust $52 \times 52 = 2704$ different spaces. Another difficulty is that there is neither a general procedure nor criterion for automatic kerning; therefore, kerning is still done manually or with heuristics. In this paper, we tackle kerning by proposing two machine-learning models, called pairwise and set-wise models. The former is a simple deep neural network that estimates the letter space for two given letter images. In contrast, the latter is a transformer-based model that estimates the letter spaces for three or more given letter images. For example, the set-wise model simultaneously estimates 2704 spaces for 52 letter images for a certain font. Among the two models, the set-wise model is not only more efficient but also more accurate because its internal self-attention mechanism allows for more consistent kerning for all letters. Experimental results on about 2500 Google fonts and their quantitative and qualitative analyses show that the set-wise model has an average estimation error of only about 5.3 pixels when the average letter space of all fonts and letter pairs is about 115 pixels.
