Rethinking HTG Evaluation: Bridging Generation and Recognition
Konstantina Nikolaidou, George Retsinas, Giorgos Sfikas, Marcus Liwicki
TL;DR
This paper tackles the misalignment between HTG evaluation and downstream utility by introducing three task-driven metrics, $HTGHTR$, $HTGstyle$, and $HTGOOV$, and a standardized evaluation protocol that jointly assesses style fidelity, content accuracy, and data diversity using $HTR$ and $WI$ models on the IAM dataset. The authors show that conventional metrics like $FID$ fail to reflect practical utility for HTG and demonstrate how the new metrics provide richer, task-relevant signals for improving $HTR$ performance, as measured by the $CER$ on real test data. Through extensive experiments across four HTG systems, they demonstrate that data variability and effective style transfer correlate with downstream recognition gains, and that filtering synthetic data can improve $HTR$ performance for certain styles. The work advocates standardized benchmarking for HTG and provides public code to enable reproducible evaluation.
Abstract
The evaluation of generative models for natural image tasks has been extensively studied. Similar protocols and metrics are used in cases with unique particularities, such as Handwriting Generation, even if they might not be completely appropriate. In this work, we introduce three measures tailored for HTG evaluation, $ \text{HTG}_{\text{HTR}} $, $ \text{HTG}_{\text{style}} $, and $ \text{HTG}_{\text{OOV}} $, and argue that they are more expedient to evaluate the quality of generated handwritten images. The metrics rely on the recognition error/accuracy of Handwriting Text Recognition and Writer Identification models and emphasize writing style, textual content, and diversity as the main aspects that adhere to the content of handwritten images. We conduct comprehensive experiments on the IAM handwriting database, showcasing that widely used metrics such as FID fail to properly quantify the diversity and the practical utility of generated handwriting samples. Our findings show that our metrics are richer in information and underscore the necessity of standardized evaluation protocols in HTG. The proposed metrics provide a more robust and informative protocol for assessing HTG quality, contributing to improved performance in HTR. Code for the evaluation protocol is available at: https://github.com/koninik/HTG_evaluation.
