SAGHOG: Self-Supervised Autoencoder for Generating HOG Features for Writer Retrieval
Marco Peer, Florian Kleber, Robert Sablatnig
TL;DR
SAGHOG tackles writer retrieval by pretraining a Vision Transformer to reconstruct HOG representations of binarized handwriting, using Segment Anything to isolate handwriting from historical documents. A two-stage approach combines self-supervised HOG reconstruction with a NetRVLAD encoding layer for final retrieval, evaluated on Historical-WI, HisFrag20, and GRK-Papyri, with notable gains on challenging data like HisFrag20 (mAP up to 57.2% unsupervised) and Top-1 58.0% on GRK-Papyri. Ablation studies demonstrate the importance of SAM preprocessing, HOG-bin targets, and encoder dimensionality, and show that finetuning SAGHOG with NetRVLAD is particularly effective on complex datasets. The work highlights the viability of ViT-based, self-supervised handwriting representations for writer retrieval, enabling strong performance even under domain shifts and limited data, and provides a path for future enhancements via data diversity and domain-specific style embeddings.
Abstract
This paper introduces SAGHOG, a self-supervised pretraining strategy for writer retrieval using HOG features of the binarized input image. Our preprocessing involves the application of the Segment Anything technique to extract handwriting from various datasets, ending up with about 24k documents, followed by training a vision transformer on reconstructing masked patches of the handwriting. SAGHOG is then finetuned by appending NetRVLAD as an encoding layer to the pretrained encoder. Evaluation of our approach on three historical datasets, Historical-WI, HisFrag20, and GRK-Papyri, demonstrates the effectiveness of SAGHOG for writer retrieval. Additionally, we provide ablation studies on our architecture and evaluate un- and supervised finetuning. Notably, on HisFrag20, SAGHOG outperforms related work with a mAP of 57.2 % - a margin of 11.6 % to the current state of the art, showcasing its robustness on challenging data, and is competitive on even small datasets, e.g. GRK-Papyri, where we achieve a Top-1 accuracy of 58.0%.
