Table of Contents
Fetching ...

Parameter-Efficient Fine-Tuning of DINOv2 for Large-Scale Font Classification

Daniel Chen, Zaria Zinn, Marcus Lowe

TL;DR

A synthetic dataset generation pipeline that renders Google Fonts at scale with diverse augmentations including randomized colors, alignment, line wrapping, and Gaussian noise is introduced, producing training images that generalize to real-world typographic samples.

Abstract

We present a font classification system capable of identifying 394 font families from rendered text images. Our approach fine-tunes a DINOv2 Vision Transformer using Low-Rank Adaptation (LoRA), achieving approximately 86% top-1 accuracy while training fewer than 1% of the model's 87.2M parameters. We introduce a synthetic dataset generation pipeline that renders Google Fonts at scale with diverse augmentations including randomized colors, alignment, line wrapping, and Gaussian noise, producing training images that generalize to real-world typographic samples. The model incorporates built-in preprocessing to ensure consistency between training and inference, and is deployed as a HuggingFace Inference Endpoint. We release the model, dataset, and full training pipeline as open-source resources.

Parameter-Efficient Fine-Tuning of DINOv2 for Large-Scale Font Classification

TL;DR

A synthetic dataset generation pipeline that renders Google Fonts at scale with diverse augmentations including randomized colors, alignment, line wrapping, and Gaussian noise is introduced, producing training images that generalize to real-world typographic samples.

Abstract

We present a font classification system capable of identifying 394 font families from rendered text images. Our approach fine-tunes a DINOv2 Vision Transformer using Low-Rank Adaptation (LoRA), achieving approximately 86% top-1 accuracy while training fewer than 1% of the model's 87.2M parameters. We introduce a synthetic dataset generation pipeline that renders Google Fonts at scale with diverse augmentations including randomized colors, alignment, line wrapping, and Gaussian noise, producing training images that generalize to real-world typographic samples. The model incorporates built-in preprocessing to ensure consistency between training and inference, and is deployed as a HuggingFace Inference Endpoint. We release the model, dataset, and full training pipeline as open-source resources.
Paper Structure (29 sections, 4 equations, 7 figures, 2 tables)

This paper contains 29 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Sample test images from four font families spanning different typographic categories: serif (Crimson Pro), sans-serif (Inter), monospace (JetBrains Mono), and display (Big Shoulders Text). Each column shows a different weight variant. These are the actual rendered images seen by the model at inference time, including color augmentation and noise.
  • Figure 2: Row-normalized confusion matrix across all font classes, grouped by font family. The strong diagonal indicates high per-class accuracy, with most off-diagonal mass concentrated within family blocks (weight variant confusion).
  • Figure 3: Top-20 most frequent misclassification pairs. The majority involve adjacent weight variants within the same font family, confirming that inter-family classification is near-perfect.
  • Figure 4: t-SNE visualization of [CLS] token embeddings from the final hidden layer, colored by font family. Points belonging to the same family cluster tightly together, indicating that the model learns a representation space that groups typographically related variants. Distinct font categories (serif, sans-serif, monospace) occupy well-separated regions.
  • Figure 5: Classification accuracy broken down by font family, sorted from lowest to highest. Most families achieve near-perfect accuracy; lower-performing families tend to have many visually similar weight variants.
  • ...and 2 more figures