Semi-Supervised Adaptation of Diffusion Models for Handwritten Text Generation
Kai Brandenbusch
TL;DR
This work presents a text-and-style conditional latent diffusion framework for handwritten text generation (HTG) that can synthesize word images in the handwriting style of unseen writers. It leverages a masked autoencoder (MAE) to produce writer-style embeddings and extends the content encoder to fuse textual and stylistic conditioning, enabling multiple conditioning strategies within a latent diffusion model (LDM). A semi-supervised training scheme is proposed to adapt the model to new, unlabeled datasets, improving generation quality when transcriptions are unavailable. Evaluations on IAM and RIMES show that classifier-free guidance and style-conditioning via the TA/TS approaches enhance content fidelity and style realism, with semi-supervised training yielding notable gains in unseen-domain generation, though gaps remain relative to fully supervised baselines. Overall, the method advances HTG by enabling unseen-writer synthesis and practical adaptation to new datasets for generating training data for handwriting recognition models.
Abstract
The generation of images of realistic looking, readable handwritten text is a challenging task which is referred to as handwritten text generation (HTG). Given a string and examples from a writer, the goal is to synthesize an image depicting the correctly spelled word in handwriting with the calligraphic style of the desired writer. An important application of HTG is the generation of training images in order to adapt downstream models for new data sets. With their success in natural image generation, diffusion models (DMs) have become the state-of-the-art approach in HTG. In this work, we present an extension of a latent DM for HTG to enable generation of writing styles not seen during training by learning style conditioning with a masked auto encoder. Our proposed content encoder allows for different ways of conditioning the DM on textual and calligraphic features. Additionally, we employ classifier-free guidance and explore the influence on the quality of the generated training images. For adapting the model to a new unlabeled data set, we propose a semi-supervised training scheme. We evaluate our approach on the IAM-database and use the RIMES-database to examine the generation of data not seen during training achieving improvements in this particularly promising application of DMs for HTG.
