Table of Contents
Fetching ...

VecGlypher: Unified Vector Glyph Generation with Language Models

Xiaoke Huang, Bhavul Gauri, Kam Woh Ng, Tony Ng, Mengmeng Xu, Zhiheng Liu, Weiming Ren, Zhaochong An, Zijian Zhou, Haonan Qiu, Yuyin Zhou, Sen He, Ziheng Wang, Tao Xiang, Xiao Han

TL;DR

On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance.

Abstract

Vector glyphs are the atomic units of digital typography, yet most learning-based pipelines still depend on carefully curated exemplar sheets and raster-to-vector postprocessing, which limits accessibility and editability. We introduce VecGlypher, a single multimodal language model that generates high-fidelity vector glyphs directly from text descriptions or image exemplars. Given a style prompt, optional reference glyph images, and a target character, VecGlypher autoregressively emits SVG path tokens, avoiding raster intermediates and producing editable, watertight outlines in one pass. A typography-aware data and training recipe makes this possible: (i) a large-scale continuation stage on 39K noisy Envato fonts to master SVG syntax and long-horizon geometry, followed by (ii) post-training on 2.5K expert-annotated Google Fonts with descriptive tags and exemplars to align language and imagery with geometry; preprocessing normalizes coordinate frames, canonicalizes paths, de-duplicates families, and quantizes coordinates for stable long-sequence decoding. On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with marked gains over DeepVecFont-v2 and DualVector. Ablations show that model scale and the two-stage recipe are critical and that absolute-coordinate serialization yields the best geometry. VecGlypher lowers the barrier to font creation by letting users design with words or exemplars, and provides a scalable foundation for future multimodal design tools.

VecGlypher: Unified Vector Glyph Generation with Language Models

TL;DR

On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance.

Abstract

Vector glyphs are the atomic units of digital typography, yet most learning-based pipelines still depend on carefully curated exemplar sheets and raster-to-vector postprocessing, which limits accessibility and editability. We introduce VecGlypher, a single multimodal language model that generates high-fidelity vector glyphs directly from text descriptions or image exemplars. Given a style prompt, optional reference glyph images, and a target character, VecGlypher autoregressively emits SVG path tokens, avoiding raster intermediates and producing editable, watertight outlines in one pass. A typography-aware data and training recipe makes this possible: (i) a large-scale continuation stage on 39K noisy Envato fonts to master SVG syntax and long-horizon geometry, followed by (ii) post-training on 2.5K expert-annotated Google Fonts with descriptive tags and exemplars to align language and imagery with geometry; preprocessing normalizes coordinate frames, canonicalizes paths, de-duplicates families, and quantizes coordinates for stable long-sequence decoding. On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with marked gains over DeepVecFont-v2 and DualVector. Ablations show that model scale and the two-stage recipe are critical and that absolute-coordinate serialization yields the best geometry. VecGlypher lowers the barrier to font creation by letting users design with words or exemplars, and provides a scalable foundation for future multimodal design tools.
Paper Structure (42 sections, 1 equation, 8 figures, 13 tables)

This paper contains 42 sections, 1 equation, 8 figures, 13 tables.

Figures (8)

  • Figure 1: VecGlypher generates high-fidelity vector glyphs directly as editable SVG outlines under two types of conditioning: (a) image-referenced generation, where a handful of exemplar glyph images specify the style and the model synthesizes new glyphs in the same visual form; and (b) text-referenced generation, where a natural-language prompt drives the synthesis without requiring exemplars. The figure shows the synthesized wordmark and sample vector outlines, highlighting one-pass generation of clean, controllable contours for typography workflows.
  • Figure 2: Paradigm comparisons. a) Prior image-referenced pipelines use separate image and vector encoder–decoders and a geometry post-optimizer. b) Diffusion-based approaches cascade image diffusion with a vector decoder. c) VecGlypher unifies both text- and image-referenced conditioning within a single LLM: given a style description or reference glyph images plus a target character, the model autoregressively emits SVG path tokens that detokenize to a valid SVG path. This formulation removes raster intermediates and exemplar-sheet requirements while producing directly editable vectors. A practical workflow is to first generate a few reference glyphs from text descriptions, then bootstrap with those images to synthesize the full font.
  • Figure 3: VecGlypher pipeline and training recipe. a) A text tokenizer or image encoder condition the LLM ($||$ denotes mutually exclusive choice), which predicts the next SVG token until the path is produced; detokenization yields SVG paths that we rasterize for display only. b) Training is two-stage: 1) SFT on Envato (text-referenced only) teaches SVG syntax and long-horizon geometry; 2) SFT on Google Fonts (text- or image-referenced) aligns geometry to appearance instructions. No raster denoisers or post-optimizers are used.
  • Figure 4: Text-referenced ablations. Representative text-to-glyph generations across model sizes and data regimens: ground truth (GT), Google-only 4B and 27B models, Envato-only 27B, mixed E+G 27B, and two-stage E$\to$G 27B. Using the same style tags across columns, scaling and the two-stage recipe yield cleaner closures, stable counters, and more faithful style. Please refer to the supplementary materials for comprehensive results.
  • Figure 5: Image-referenced ablations. Given 1--8 reference glyphs from a font, we compare Google-only 4B/27B with two-stage variants (E$\to$G I and E$\to$G T,I). The two-stage settings at 27B best transfer style and preserve thin structures and closures; Google-only baselines underperform in geometry. Please refer to the supplementary materials for comprehensive results.
  • ...and 3 more figures