Table of Contents
Fetching ...

Semi-Supervised Adaptation of Diffusion Models for Handwritten Text Generation

Kai Brandenbusch

TL;DR

This work presents a text-and-style conditional latent diffusion framework for handwritten text generation (HTG) that can synthesize word images in the handwriting style of unseen writers. It leverages a masked autoencoder (MAE) to produce writer-style embeddings and extends the content encoder to fuse textual and stylistic conditioning, enabling multiple conditioning strategies within a latent diffusion model (LDM). A semi-supervised training scheme is proposed to adapt the model to new, unlabeled datasets, improving generation quality when transcriptions are unavailable. Evaluations on IAM and RIMES show that classifier-free guidance and style-conditioning via the TA/TS approaches enhance content fidelity and style realism, with semi-supervised training yielding notable gains in unseen-domain generation, though gaps remain relative to fully supervised baselines. Overall, the method advances HTG by enabling unseen-writer synthesis and practical adaptation to new datasets for generating training data for handwriting recognition models.

Abstract

The generation of images of realistic looking, readable handwritten text is a challenging task which is referred to as handwritten text generation (HTG). Given a string and examples from a writer, the goal is to synthesize an image depicting the correctly spelled word in handwriting with the calligraphic style of the desired writer. An important application of HTG is the generation of training images in order to adapt downstream models for new data sets. With their success in natural image generation, diffusion models (DMs) have become the state-of-the-art approach in HTG. In this work, we present an extension of a latent DM for HTG to enable generation of writing styles not seen during training by learning style conditioning with a masked auto encoder. Our proposed content encoder allows for different ways of conditioning the DM on textual and calligraphic features. Additionally, we employ classifier-free guidance and explore the influence on the quality of the generated training images. For adapting the model to a new unlabeled data set, we propose a semi-supervised training scheme. We evaluate our approach on the IAM-database and use the RIMES-database to examine the generation of data not seen during training achieving improvements in this particularly promising application of DMs for HTG.

Semi-Supervised Adaptation of Diffusion Models for Handwritten Text Generation

TL;DR

This work presents a text-and-style conditional latent diffusion framework for handwritten text generation (HTG) that can synthesize word images in the handwriting style of unseen writers. It leverages a masked autoencoder (MAE) to produce writer-style embeddings and extends the content encoder to fuse textual and stylistic conditioning, enabling multiple conditioning strategies within a latent diffusion model (LDM). A semi-supervised training scheme is proposed to adapt the model to new, unlabeled datasets, improving generation quality when transcriptions are unavailable. Evaluations on IAM and RIMES show that classifier-free guidance and style-conditioning via the TA/TS approaches enhance content fidelity and style realism, with semi-supervised training yielding notable gains in unseen-domain generation, though gaps remain relative to fully supervised baselines. Overall, the method advances HTG by enabling unseen-writer synthesis and practical adaptation to new datasets for generating training data for handwriting recognition models.

Abstract

The generation of images of realistic looking, readable handwritten text is a challenging task which is referred to as handwritten text generation (HTG). Given a string and examples from a writer, the goal is to synthesize an image depicting the correctly spelled word in handwriting with the calligraphic style of the desired writer. An important application of HTG is the generation of training images in order to adapt downstream models for new data sets. With their success in natural image generation, diffusion models (DMs) have become the state-of-the-art approach in HTG. In this work, we present an extension of a latent DM for HTG to enable generation of writing styles not seen during training by learning style conditioning with a masked auto encoder. Our proposed content encoder allows for different ways of conditioning the DM on textual and calligraphic features. Additionally, we employ classifier-free guidance and explore the influence on the quality of the generated training images. For adapting the model to a new unlabeled data set, we propose a semi-supervised training scheme. We evaluate our approach on the IAM-database and use the RIMES-database to examine the generation of data not seen during training achieving improvements in this particularly promising application of DMs for HTG.

Paper Structure

This paper contains 36 sections, 29 equations, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the proposed system during training given an input image $\hbox{\boldmath$x$}_{0}$, its transcription $\prescript{}{}{\hbox{\boldmath$y$}}$ and a set $\mathcal{X}_{w}^{K}$ of examples from the sample writer. The system comprises the encoder part $V_{E}$ of the VAE, style encoder $S_{E}$, content encoder $C_{E}$ and the LDM. The LDM $\hbox{\boldmath$\epsilon$}_{\Theta}\mathopen{}\left( \hbox{\boldmath$z$}_{t}, \hbox{\boldmath$t$}_{emb}, \mathbf{C} \right)\mathclose{}$ is trained to predict the noise added in step $t$ of the forward diffusion process.
  • Figure 2: Overview of the proposed system when sampling an image given the desired string $\prescript{}{}{\hbox{\boldmath$y$}}$ and a set $\mathcal{X}_{w}^{K}$ of examples from the desired writer. Starting from random noise $\hbox{\boldmath$z$}_{T} \sim \mathcal{N}\mathopen{}\left(\hbox{\boldmath$0$}, \mathbf{I}\right)\mathclose{}$, the noise $\hbox{\boldmath$\epsilon$}_{\Theta}\mathopen{}\left( \hbox{\boldmath$z$}_{t}, \hbox{\boldmath$t$}_{emb}, \mathbf{C} \right)\mathclose{}$ is predicted and removed. This procedure is repeated for $T$ timesteps to compute $\hat{\hbox{\boldmath$z$}}_0$ which is passed to the decoder part $V_{D}$ of the VAE to obtain the final image $\hat{\hbox{\boldmath$x$}}$.
  • Figure 3: Computation of the content embedding $\mathbf{C}=C_{E}\mathopen{}\left( \prescript{}{}{\hbox{\boldmath$y$}}, \hbox{\boldmath$s$}_{w} \right)\mathclose{}$ and the timestep embedding $\hbox{\boldmath$t$}_{emb} = C_{T}\mathopen{}\left( t, \hbox{\boldmath$s$}_{w} \right)\mathclose{}$ for different choices of incorporating the style embedding. PE and LPE denote positional encodings and learned positional embeddings, respectively. Operands are $+$ for elementwise addition, $\bigoplus$ for concatenation of two vectors and $\mathopen{}\mathbin\Vert\mathclose{}$ for appending a vector to the sequence.