Table of Contents
Fetching ...

Zero-Shot Paragraph-level Handwriting Imitation with Latent Diffusion Models

Martin Mayr, Marcel Dreier, Florian Kordon, Mathias Seuret, Jochen Zöllner, Fei Wu, Andreas Maier, Vincent Christlein

TL;DR

A modified latent diffusion model is introduced that enriches the encoder-decoder mechanism with specialized loss functions that explicitly preserve the style and content of handwriting and significantly improves the realism of the generated handwriting.

Abstract

The imitation of cursive handwriting is mainly limited to generating handwritten words or lines. Multiple synthetic outputs must be stitched together to create paragraphs or whole pages, whereby consistency and layout information are lost. To close this gap, we propose a method for imitating handwriting at the paragraph level that also works for unseen writing styles. Therefore, we introduce a modified latent diffusion model that enriches the encoder-decoder mechanism with specialized loss functions that explicitly preserve the style and content. We enhance the attention mechanism of the diffusion model with adaptive 2D positional encoding and the conditioning mechanism to work with two modalities simultaneously: a style image and the target text. This significantly improves the realism of the generated handwriting. Our approach sets a new benchmark in our comprehensive evaluation. It outperforms all existing imitation methods at both line and paragraph levels, considering combined style and content preservation.

Zero-Shot Paragraph-level Handwriting Imitation with Latent Diffusion Models

TL;DR

A modified latent diffusion model is introduced that enriches the encoder-decoder mechanism with specialized loss functions that explicitly preserve the style and content of handwriting and significantly improves the realism of the generated handwriting.

Abstract

The imitation of cursive handwriting is mainly limited to generating handwritten words or lines. Multiple synthetic outputs must be stitched together to create paragraphs or whole pages, whereby consistency and layout information are lost. To close this gap, we propose a method for imitating handwriting at the paragraph level that also works for unseen writing styles. Therefore, we introduce a modified latent diffusion model that enriches the encoder-decoder mechanism with specialized loss functions that explicitly preserve the style and content. We enhance the attention mechanism of the diffusion model with adaptive 2D positional encoding and the conditioning mechanism to work with two modalities simultaneously: a style image and the target text. This significantly improves the realism of the generated handwriting. Our approach sets a new benchmark in our comprehensive evaluation. It outperforms all existing imitation methods at both line and paragraph levels, considering combined style and content preservation.
Paper Structure (34 sections, 2 equations, 11 figures, 10 tables)

This paper contains 34 sections, 2 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Compared to existing methods, our approach produces more realistic synthetic handwritten paragraphs in a specific style.
  • Figure 2: Method overview. We transfer the handwritten paragraphs in and out of latent space via encoder $\mathcal{E}$ and decoder $\mathcal{D}$. The Denoising U-Net $\epsilon_{\Theta}$ is trained in latent space and conditioned with cross-attention. As conditioning information, we have two inputs: (1) a style image $x_{\text{style}}$, which we encode with a shallow CNN $\mathcal{E}_{style}$, and (2) a target text $x_{\text{text}}$, which we embed into feature space. We fuse both modalities with a transformer and forward them as a stylized embedding into the Denoising U-Net via cross-attention.
  • Figure 3: Comparison of text generation and style imitation performances based on a style (top) and target text of a genuine sample (bottom). Images were sampled at random and cropped after the three lines.
  • Figure 4: UMap visualization of the five most present writers in the IAM test set, colour-coded in the plot. It shows that our generated samples ($\times$) are much closer to the genuine samples ($\bullet$) than those generated by the other methods ($\blacksquare$, $\blacktriangle$, $\blacklozenge$).
  • Figure 5: Qualitative comparison of the paragraph reconstructions showing that the additional htr and wi losses are beneficial.
  • ...and 6 more figures