Table of Contents
Fetching ...

VATr++: Choose Your Words Wisely for Handwritten Text Generation

Bram Vanherle, Vittorio Pippi, Silvia Cascianelli, Nick Michiels, Frank Van Reeth, Rita Cucchiara

TL;DR

The paper tackles standardized evaluation and rare-character generation in Styled Handwritten Text Generation (HTG). It introduces VATr++, an extension of VATr with input preparation and regularization, leveraging Visual Archetypes and a Transformer-based architecture. A large-scale synthetic pretraining of the style encoder improves generalization to unseen styles. Evaluation on the IAM dataset with a fixed protocol shows VATr++ outperforms baselines on FID, KID, and CER, especially for long-tail characters, and demonstrates cross-dataset generalization.

Abstract

Styled Handwritten Text Generation (HTG) has received significant attention in recent years, propelled by the success of learning-based solutions employing GANs, Transformers, and, preliminarily, Diffusion Models. Despite this surge in interest, there remains a critical yet understudied aspect - the impact of the input, both visual and textual, on the HTG model training and its subsequent influence on performance. This study delves deeper into a cutting-edge Styled-HTG approach, proposing strategies for input preparation and training regularization that allow the model to achieve better performance and generalize better. These aspects are validated through extensive analysis on several different settings and datasets. Moreover, in this work, we go beyond performance optimization and address a significant hurdle in HTG research - the lack of a standardized evaluation protocol. In particular, we propose a standardization of the evaluation protocol for HTG and conduct a comprehensive benchmarking of existing approaches. By doing so, we aim to establish a foundation for fair and meaningful comparisons between HTG strategies, fostering progress in the field.

VATr++: Choose Your Words Wisely for Handwritten Text Generation

TL;DR

The paper tackles standardized evaluation and rare-character generation in Styled Handwritten Text Generation (HTG). It introduces VATr++, an extension of VATr with input preparation and regularization, leveraging Visual Archetypes and a Transformer-based architecture. A large-scale synthetic pretraining of the style encoder improves generalization to unseen styles. Evaluation on the IAM dataset with a fixed protocol shows VATr++ outperforms baselines on FID, KID, and CER, especially for long-tail characters, and demonstrates cross-dataset generalization.

Abstract

Styled Handwritten Text Generation (HTG) has received significant attention in recent years, propelled by the success of learning-based solutions employing GANs, Transformers, and, preliminarily, Diffusion Models. Despite this surge in interest, there remains a critical yet understudied aspect - the impact of the input, both visual and textual, on the HTG model training and its subsequent influence on performance. This study delves deeper into a cutting-edge Styled-HTG approach, proposing strategies for input preparation and training regularization that allow the model to achieve better performance and generalize better. These aspects are validated through extensive analysis on several different settings and datasets. Moreover, in this work, we go beyond performance optimization and address a significant hurdle in HTG research - the lack of a standardized evaluation protocol. In particular, we propose a standardization of the evaluation protocol for HTG and conduct a comprehensive benchmarking of existing approaches. By doing so, we aim to establish a foundation for fair and meaningful comparisons between HTG strategies, fostering progress in the field.
Paper Structure (27 sections, 6 equations, 25 figures, 11 tables)

This paper contains 27 sections, 6 equations, 25 figures, 11 tables.

Figures (25)

  • Figure 1: We extend our previous State-of-the-Art Styled Handwritten Text Generation model VATrpippi2023handwritten by training it on a modified dataset and using smart augmentations of the training signals. This enables the network to generate rare characters more faithfully and generalize to new styles.
  • Figure 2: Overview of our extended Visual Archetypes-based Transformer for HTG (VATr++). Style samples are fed to the synthetically pre-trained style encoder, which produces a style vector for the author. The style vector is passed through the decoder along with a linear projection of the visual archetype representation of the desired text. During training, that text is augmented to increase variation. The decoder uses the style vector and text representation to generate an image. The training process is guided by a discriminator, an HTR network, and a style classification network. These networks are jointly optimized during training, utilizing real images from the IAM dataset. The input to these auxiliary networks is also augmented to prevent overfitting.
  • Figure 3: Punctuation marks are considered words in the IAM dataset as used for HTG. This causes ambiguity and inconsistency since both characters and punctuation marks are all scaled to the same height.
  • Figure 4: We modify the IAM dataset by attaching single punctuation marks (blue) to their closest words in the line-level IAM (red). This style input preparation strategy helps prevent ambiguity and inconsistency.
  • Figure 5: Text augmentation is used to increase the number of rare characters during training. Here, the word Orval is augmented. The letter a gets chosen to be swapped due to its high occurrence in the training corpus and is replaced by the character 5, which is rare in the dataset.
  • ...and 20 more figures