Table of Contents
Fetching ...

The Cursive Transformer

Sam Greydanus, Zachary Wimpee

TL;DR

The paper tackles generating realistic cursive handwriting conditioned on text by introducing a simple tokenization scheme that maps pen strokes to polar coordinates, bins ($\theta$, $r$), and two tokens per stroke, then trains a vanilla GPT model with cross-attention on ASCII input. This approach eliminates the need for mixture density networks or specialized attention heads, using a 3500-sample dataset and simple augmentations to demonstrate high-quality cursive generation. Key contributions include the polar-coordinate tokenizer, a compact 523-token vocabulary, emergence of ASCII-stroke alignment through cross-attention, and a demonstration that small GPT-based models can match image-based handwriting methods. The work suggests a generalizable strategy for niche, continuous-data modalities and potential extensions to robotics and 3D motion through analogous tokenization.

Abstract

Transformers trained on tokenized text, audio, and images can generate high-quality autoregressive samples. But handwriting data, represented as sequences of pen coordinates, remains underexplored. We introduce a novel tokenization scheme that converts pen stroke offsets to polar coordinates, discretizes them into bins, and then turns them into sequences of tokens with which to train a standard GPT model. This allows us to capture complex stroke distributions without using any specialized architectures (eg. the mixture density network or the self-advancing ASCII attention head from Graves 2014). With just 3,500 handwritten words and a few simple data augmentations, we are able to train a model that can generate realistic cursive handwriting. Our approach is simpler and more performant than previous RNN-based methods.

The Cursive Transformer

TL;DR

The paper tackles generating realistic cursive handwriting conditioned on text by introducing a simple tokenization scheme that maps pen strokes to polar coordinates, bins (, ), and two tokens per stroke, then trains a vanilla GPT model with cross-attention on ASCII input. This approach eliminates the need for mixture density networks or specialized attention heads, using a 3500-sample dataset and simple augmentations to demonstrate high-quality cursive generation. Key contributions include the polar-coordinate tokenizer, a compact 523-token vocabulary, emergence of ASCII-stroke alignment through cross-attention, and a demonstration that small GPT-based models can match image-based handwriting methods. The work suggests a generalizable strategy for niche, continuous-data modalities and potential extensions to robotics and 3D motion through analogous tokenization.

Abstract

Transformers trained on tokenized text, audio, and images can generate high-quality autoregressive samples. But handwriting data, represented as sequences of pen coordinates, remains underexplored. We introduce a novel tokenization scheme that converts pen stroke offsets to polar coordinates, discretizes them into bins, and then turns them into sequences of tokens with which to train a standard GPT model. This allows us to capture complex stroke distributions without using any specialized architectures (eg. the mixture density network or the self-advancing ASCII attention head from Graves 2014). With just 3,500 handwritten words and a few simple data augmentations, we are able to train a model that can generate realistic cursive handwriting. Our approach is simpler and more performant than previous RNN-based methods.

Paper Structure

This paper contains 6 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Lines between characters are shaped by their neighbors.
  • Figure 2: The opening lines of Homer's Iliad generated by our model from an ASCII input. The model is a standard GPT architecture which we adapted to the task by developing a novel tokenization scheme for pen stroke data.
  • Figure 3: Overview of the Cursive Transformer pipeline. (a) Collecting handwriting data as pen stroke sequences. (b) Computing stroke offsets in polar coordinates ($\theta$ and $r$). (c) Discretizing $\theta$ and $r$ into bins. (d) Tokenizing discrete variables for GPT-2 training. (e) Training the model to generate cursive from ASCII input.
  • Figure 4: Exploring cross-attention patterns (top row) and self-attention patterns (bottom row). The cross-attention pattern shows how at early layers (layer 2) the model does not use ASCII information. In layer 3 it begins to attend to ASCII characters: both the current ASCII token and its neighbors immediately before and after. Layer 4 and layer 5 show considerably tighter attention patterns, with layer 5 focusing almost entirely on the current character token. Note that the model uses more stroke tokens to draw some characters than others (eg, '?' or 'A' versus the spaces). Self-attention patterns are harder to interpret, but tend to show increasing differentiation and variation as one moves up the layers. See Appendix for plots of all heads and layers.
  • Figure 5: Example of training data collected via the web app and trackpad input. Each word was collected separately; here they have been appended to one another to make a single, 5-word training sequence. Note: our final model uses 4-word training sequences.
  • ...and 3 more figures