Table of Contents
Fetching ...

SAGHOG: Self-Supervised Autoencoder for Generating HOG Features for Writer Retrieval

Marco Peer, Florian Kleber, Robert Sablatnig

TL;DR

SAGHOG tackles writer retrieval by pretraining a Vision Transformer to reconstruct HOG representations of binarized handwriting, using Segment Anything to isolate handwriting from historical documents. A two-stage approach combines self-supervised HOG reconstruction with a NetRVLAD encoding layer for final retrieval, evaluated on Historical-WI, HisFrag20, and GRK-Papyri, with notable gains on challenging data like HisFrag20 (mAP up to 57.2% unsupervised) and Top-1 58.0% on GRK-Papyri. Ablation studies demonstrate the importance of SAM preprocessing, HOG-bin targets, and encoder dimensionality, and show that finetuning SAGHOG with NetRVLAD is particularly effective on complex datasets. The work highlights the viability of ViT-based, self-supervised handwriting representations for writer retrieval, enabling strong performance even under domain shifts and limited data, and provides a path for future enhancements via data diversity and domain-specific style embeddings.

Abstract

This paper introduces SAGHOG, a self-supervised pretraining strategy for writer retrieval using HOG features of the binarized input image. Our preprocessing involves the application of the Segment Anything technique to extract handwriting from various datasets, ending up with about 24k documents, followed by training a vision transformer on reconstructing masked patches of the handwriting. SAGHOG is then finetuned by appending NetRVLAD as an encoding layer to the pretrained encoder. Evaluation of our approach on three historical datasets, Historical-WI, HisFrag20, and GRK-Papyri, demonstrates the effectiveness of SAGHOG for writer retrieval. Additionally, we provide ablation studies on our architecture and evaluate un- and supervised finetuning. Notably, on HisFrag20, SAGHOG outperforms related work with a mAP of 57.2 % - a margin of 11.6 % to the current state of the art, showcasing its robustness on challenging data, and is competitive on even small datasets, e.g. GRK-Papyri, where we achieve a Top-1 accuracy of 58.0%.

SAGHOG: Self-Supervised Autoencoder for Generating HOG Features for Writer Retrieval

TL;DR

SAGHOG tackles writer retrieval by pretraining a Vision Transformer to reconstruct HOG representations of binarized handwriting, using Segment Anything to isolate handwriting from historical documents. A two-stage approach combines self-supervised HOG reconstruction with a NetRVLAD encoding layer for final retrieval, evaluated on Historical-WI, HisFrag20, and GRK-Papyri, with notable gains on challenging data like HisFrag20 (mAP up to 57.2% unsupervised) and Top-1 58.0% on GRK-Papyri. Ablation studies demonstrate the importance of SAM preprocessing, HOG-bin targets, and encoder dimensionality, and show that finetuning SAGHOG with NetRVLAD is particularly effective on complex datasets. The work highlights the viability of ViT-based, self-supervised handwriting representations for writer retrieval, enabling strong performance even under domain shifts and limited data, and provides a path for future enhancements via data diversity and domain-specific style embeddings.

Abstract

This paper introduces SAGHOG, a self-supervised pretraining strategy for writer retrieval using HOG features of the binarized input image. Our preprocessing involves the application of the Segment Anything technique to extract handwriting from various datasets, ending up with about 24k documents, followed by training a vision transformer on reconstructing masked patches of the handwriting. SAGHOG is then finetuned by appending NetRVLAD as an encoding layer to the pretrained encoder. Evaluation of our approach on three historical datasets, Historical-WI, HisFrag20, and GRK-Papyri, demonstrates the effectiveness of SAGHOG for writer retrieval. Additionally, we provide ablation studies on our architecture and evaluate un- and supervised finetuning. Notably, on HisFrag20, SAGHOG outperforms related work with a mAP of 57.2 % - a margin of 11.6 % to the current state of the art, showcasing its robustness on challenging data, and is competitive on even small datasets, e.g. GRK-Papyri, where we achieve a Top-1 accuracy of 58.0%.
Paper Structure (33 sections, 5 figures, 7 tables)

This paper contains 33 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overview of SAGHOG. The decoder reconstructs the HOG features of the masked tokens ( ) by only using the non-masked tokens ( ). In the second stage, the pretrained encoder with NetRVLAD is finetuned, either by training on writer or pseudo labels.
  • Figure 2: Preprocessing scheme. (\ref{['fig:sam']}) We obtain the handwriting of a page by processing the segmentation of SAM. (\ref{['fig:qual_examples']}) Examples of the produced images. (\ref{['fig:sampling']}) $32\times 32$ patches are extracted by applying SIFT on the binarized image.
  • Figure 3: Examples of reconstructed HOG features. For better visualization, we stack multiple patches to create images of (128, 512). The images are not seen during pretraining. (\ref{['fig:hog_inp']}) Raw input image. (\ref{['fig:hog_masked']}) Masked input with a ratio of 0.75. (\ref{['fig:hog_pred']}) Prediction of the missing HOG features of the binarized image. (\ref{['fig:hog_gt']}) Actual HOG features.
  • Figure 4: Examples of the datasets used. (\ref{['fig:dataset_icdar2017']}) Binarized example of Historical-WI testset. (\ref{['fig:dataset_hisfrag20']}) HisFrag20: Sample of train- and testset. (\ref{['fig:dataset_grk']}) GRK-Papyri (Dioscorus-5) color and binarized version christlein_papyri - we only use the binarized set.
  • Figure 5: Qualitative results of the retrieval on HisFrag20. Left: Query. We show the four nearest documents. If the retrieved document is written by the same author as the query, we highlight the image in green, otherwise red.