Table of Contents
Fetching ...

ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

Douglass Wang

TL;DR

This work proposes ScribeTokens, a tokenization that decomposes pen movement into unit pixel steps, and introduces next-ink-token prediction as a self-supervised pretraining strategy, which consistently improves recognition across all token-based models and accelerates convergence by up to 83x.

Abstract

Digital ink -- the coordinate stream captured from stylus or touch input -- lacks a unified representation. Continuous vector representations produce long sequences and suffer from training instability, while existing token representations require large vocabularies, face out-of-vocabulary issues, and underperform vectors on recognition. We propose ScribeTokens, a tokenization that decomposes pen movement into unit pixel steps. Together with two pen-state tokens, this fixed 10-token base vocabulary suffices to represent any digital ink and enables aggressive BPE compression. On handwritten text generation, ScribeTokens dramatically outperforms vectors (17.33% vs. 70.29% CER), showing tokens are far more effective for generation. On recognition, ScribeTokens is the only token representation to outperform vectors without pretraining. We further introduce next-ink-token prediction as a self-supervised pretraining strategy, which consistently improves recognition across all token-based models and accelerates convergence by up to 83x. With pretraining, ScribeTokens achieves the best recognition results across all representations on both datasets (8.27% CER on IAM, 9.83% on DeepWriting).

ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

TL;DR

This work proposes ScribeTokens, a tokenization that decomposes pen movement into unit pixel steps, and introduces next-ink-token prediction as a self-supervised pretraining strategy, which consistently improves recognition across all token-based models and accelerates convergence by up to 83x.

Abstract

Digital ink -- the coordinate stream captured from stylus or touch input -- lacks a unified representation. Continuous vector representations produce long sequences and suffer from training instability, while existing token representations require large vocabularies, face out-of-vocabulary issues, and underperform vectors on recognition. We propose ScribeTokens, a tokenization that decomposes pen movement into unit pixel steps. Together with two pen-state tokens, this fixed 10-token base vocabulary suffices to represent any digital ink and enables aggressive BPE compression. On handwritten text generation, ScribeTokens dramatically outperforms vectors (17.33% vs. 70.29% CER), showing tokens are far more effective for generation. On recognition, ScribeTokens is the only token representation to outperform vectors without pretraining. We further introduce next-ink-token prediction as a self-supervised pretraining strategy, which consistently improves recognition across all token-based models and accelerates convergence by up to 83x. With pretraining, ScribeTokens achieves the best recognition results across all representations on both datasets (8.27% CER on IAM, 9.83% on DeepWriting).
Paper Structure (57 sections, 2 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 57 sections, 2 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: ScribeTokens representation of a handwritten sentence. Pen strokes are decomposed into unit directional steps via Bresenham's algorithm, then compressed with BPE. Each color denotes a distinct BPE token; faint colors indicate pen-in-air movement between strokes. The zoom shows the sequence of arrows making up an example token.
  • Figure 2: Bresenham Decomposition of a line segment between two grid points (start , end ). The segment is rasterized into adjacent grid cells via Bresenham's algorithm, then encoded as a sequence of Freeman chain code directions.
  • Figure 3: Average compression ratios ($\uparrow$) of BPE-based digital ink representations on the IAM validation set, across target vocabulary sizes and quantization parameters $\delta$. ScribeTokens consistently achieves the highest compression across nearly all settings.
  • Figure 4: Average out-of-vocabulary (OOV) rates ($\downarrow$) of BPE-based digital ink representations on the IAM validation set, across target vocabulary sizes and quantization parameters $\delta$. ScribeTokens and TextTokens are OOV-free by construction, while AbsTokens and RelTokens exhibit non-zero OOV rates as their coordinate-based vocabularies inevitably encounter unseen values at test time.
  • Figure 5: Effect of quantization parameter $\delta$ on reconstruction quality. Each row shows the same IAM sample quantized at a different $\delta$, displayed both as raw quantized ink (left) and after Savitzky--Golay post-processing (right). Post-processed inks are visually indistinguishable from the original for $\delta \leq 8$; the row at $\delta = 8$ (highlighted in green) maximizes compression without sacrificing fidelity and is used in all downstream experiments.
  • ...and 5 more figures