Table of Contents
Fetching ...

DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

Wei Pan, Huiguo He, Hiuyi Cheng, Yilin Shi, Lianwen Jin

TL;DR

DiffInk tackles the challenge of text-to-online handwriting generation for full lines by learning a semantically structured latent space and performing conditional latent diffusion. It introduces InkVAE, which uses OCR-based and style-classification regularizations to disentangle content from writer style, and InkDiT, a latent diffusion Transformer conditioned on target text and reference style to produce coherent handwriting trajectories. The approach yields state-of-the-art content fidelity, style consistency, and efficiency on CASIA Chinese handwriting data, with strong qualitative coherence and layout integration. The framework also demonstrates potential for multilingual extension, data augmentation for OCR, and personalized handwriting applications, all while significantly reducing computational cost compared to prior character- or layout-decoupled methods.

Abstract

Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency.

DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

TL;DR

DiffInk tackles the challenge of text-to-online handwriting generation for full lines by learning a semantically structured latent space and performing conditional latent diffusion. It introduces InkVAE, which uses OCR-based and style-classification regularizations to disentangle content from writer style, and InkDiT, a latent diffusion Transformer conditioned on target text and reference style to produce coherent handwriting trajectories. The approach yields state-of-the-art content fidelity, style consistency, and efficiency on CASIA Chinese handwriting data, with strong qualitative coherence and layout integration. The framework also demonstrates potential for multilingual extension, data augmentation for OCR, and personalized handwriting applications, all while significantly reducing computational cost compared to prior character- or layout-decoupled methods.

Abstract

Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency.

Paper Structure

This paper contains 62 sections, 9 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Overview of DiffInk. By directly modeling entire text lines rather than individual characters, the model efficiently synthesizes online handwritten text lines ($G_i$) conditioned on textual input ($T$) and style references ($S_i$), achieving accurate content reproduction and consistent style in both character form and layout structure. Different colors represent distinct handwriting styles.
  • Figure 2: Latent-space visualization of Vanilla VAE vs. InkVAE (ours). While both models achieve good reconstruction, InkVAE learns a more structured latent space (visualized with t-SNE maaten2008visualizing): (a) Text-line features from 8 writers—InkVAE exhibits clearer writer-specific clusters. (b) Character-level features from 8 common characters—InkVAE yields tighter intra-class groupings and more distinct inter-class separation.
  • Figure 3: Overview of the DiffInk Framework. (a) InkVAE encodes online handwriting sequences into compact latent representations. During training, regularization losses $\mathcal{L}_{\text{ocr}}$ and $\mathcal{L}_{\text{sty}}$ are applied to the latent space to encourage disentangled glyph and style. (b) InkDiT leverages this latent space to synthesize handwriting by denoising noisy inputs $x_t$ into clean representations $x_0$. The process is conditioned on content features $Z$ obtained from text embeddings and style features $x_\text{ref}$ derived from a reference trajectory. InkDiT is trained with a diffusion loss $\mathcal{L}_{\text{diff}}$.
  • Figure 4: Comparison with SOTA methods under unseen writing styles across diverse layouts. All baseline methods generate isolated characters and compose lines via a shared layout module. Blue boxes denote the same style reference, while red boxes highlight errors or unnatural character connections. These methods suffer from stitching artifacts, especially when adjacent characters differ structurally. DiffInk generates more coherent and naturally connected text lines.
  • Figure 5: InkDiT Generation with VAE Variants. Blue boxes highlight content errors; red boxes indicate style inconsistencies. InkDiT trained on the latent space from our InkVAE yields more accurate and consistent results.
  • ...and 14 more figures