DiffusionPen: Towards Controlling the Style of Handwritten Text Generation

Konstantina Nikolaidou; George Retsinas; Giorgos Sfikas; Marcus Liwicki

DiffusionPen: Towards Controlling the Style of Handwritten Text Generation

Konstantina Nikolaidou, George Retsinas, Giorgos Sfikas, Marcus Liwicki

TL;DR

DiffusionPen introduces a latent diffusion framework for handwritten text generation conditioned on text and a small set of style exemplars ($k=5$). It employs a hybrid style encoder that combines metric learning and classification to form a meaningful, continuous writer-style space, enabling generation of both seen and unseen styles with IV and OOV words. Through extensive IAM-based experiments, DiffusionPen outperforms state-of-the-art methods in sample quality and diversity, and its synthetic data significantly boosts Handwriting Text Recognition performance, nearing results obtained with real data. The work also demonstrates style interpolation and multi-style mixing as effective mechanisms for introducing controllable variation, while acknowledging ethical considerations and practical limitations.

Abstract

Handwritten Text Generation (HTG) conditioned on text and style is a challenging task due to the variability of inter-user characteristics and the unlimited combinations of characters that form new words unseen during training. Diffusion Models have recently shown promising results in HTG but still remain under-explored. We present DiffusionPen (DiffPen), a 5-shot style handwritten text generation approach based on Latent Diffusion Models. By utilizing a hybrid style extractor that combines metric learning and classification, our approach manages to capture both textual and stylistic characteristics of seen and unseen words and styles, generating realistic handwritten samples. Moreover, we explore several variation strategies of the data with multi-style mixtures and noisy embeddings, enhancing the robustness and diversity of the generated data. Extensive experiments using IAM offline handwriting database show that our method outperforms existing methods qualitatively and quantitatively, and its additional generated data can improve the performance of Handwriting Text Recognition (HTR) systems. The code is available at: https://github.com/koninik/DiffusionPen.

DiffusionPen: Towards Controlling the Style of Handwritten Text Generation

TL;DR

DiffusionPen introduces a latent diffusion framework for handwritten text generation conditioned on text and a small set of style exemplars (

). It employs a hybrid style encoder that combines metric learning and classification to form a meaningful, continuous writer-style space, enabling generation of both seen and unseen styles with IV and OOV words. Through extensive IAM-based experiments, DiffusionPen outperforms state-of-the-art methods in sample quality and diversity, and its synthetic data significantly boosts Handwriting Text Recognition performance, nearing results obtained with real data. The work also demonstrates style interpolation and multi-style mixing as effective mechanisms for introducing controllable variation, while acknowledging ethical considerations and practical limitations.

Abstract

Paper Structure (14 sections, 18 figures, 9 tables)

This paper contains 14 sections, 18 figures, 9 tables.

Introduction
Related Work
Proposed Method
Experiments
Datasets, Training Setup, and Considered SotA Approaches
Quality Assessment
Handwriting Text Recognition
Style Variation
Limitations and Ethical Considerations
Conclusion
Style Encoder
Qualitative and Quantitative Results
Style Variation
Handwriting Text Recognition

Figures (18)

Figure 1: Qualitative results generated using our method for four cases: In-Vocabulary words and Seen style (IV-S), In-Vocabulary words and Unseen style (IV-U), Out-of-Vocabulary words and Seen style (OOV-S), Out-of-Vocabulary words and Unseen style (OOV-U), as well as digits and punctuations.
Figure 2: Overview of DiffusionPen. DiffusionPen comprises the conditional generator UNet Encoder-Decoder, having a Text Encoder $T_E$, a Style Encoder $S_E$, and a VAEE encoder during training (\ref{['fig:top']}) and VAED decoder during sampling (\ref{['fig:bottom']}).
Figure 3: A graphical representation of the hybrid style encoder $S_E$ training. The style encoder creates the feature representations of the anchor sample $f_{s_w}$, positive sample $f_{s_+}$, and negative sample $f_{s_-}$. The metric learning training part pushes the positive features closer to the anchor and the negative features further away. The model uses the class prediction $y_w$ of the anchor for the classification optimization.
Figure 4: Visual comparison of images generated by the considered approaches and our proposed method (DiffusionPen).
Figure 5: (a) Exemplar generated samples from Unseen Styles. On the left, we can see the 5 style samples used for the style condition, and on the right, the generated word. (b) Generated words of different styles comprised of more than 10 characters.
...and 13 more figures

DiffusionPen: Towards Controlling the Style of Handwritten Text Generation

TL;DR

Abstract

DiffusionPen: Towards Controlling the Style of Handwritten Text Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (18)