Self-Supervised Vision Transformers for Writer Retrieval
Tim Raven, Arthur Matei, Gernot A. Fink
TL;DR
This work addresses writer retrieval for historical handwritten documents using a self-supervised Vision Transformer (ViT) feature extractor. It intentionally uses foreground patch tokens from the ViT and aggregates them with VLAD, avoiding fine-tuning and relying on cosine similarity with optional reranking. The approach achieves state-of-the-art results on Historical-WI ($mAP=83.1 ext{%}$ with reranking) and HisIR19 ($mAP=95.0 ext{%}$ with reranking), and strong performance on CVL without fine-tuning, demonstrating robust generalization to modern handwriting. The findings show that foreground token VLAD features learned through SSL surpass traditional CNN/hard-feature methods and that selective token filtering plus efficient aggregation substantially boosts writer retrieval performance, with potential for further gains via alternative SSL strategies and architectures.
Abstract
While methods based on Vision Transformers (ViT) have achieved state-of-the-art performance in many domains, they have not yet been applied successfully in the domain of writer retrieval. The field is dominated by methods using handcrafted features or features extracted from Convolutional Neural Networks. In this work, we bridge this gap and present a novel method that extracts features from a ViT and aggregates them using VLAD encoding. The model is trained in a self-supervised fashion without any need for labels. We show that extracting local foreground features is superior to using the ViT's class token in the context of writer retrieval. We evaluate our method on two historical document collections. We set a new state-at-of-art performance on the Historical-WI dataset (83.1\% mAP), and the HisIR19 dataset (95.0\% mAP). Additionally, we demonstrate that our ViT feature extractor can be directly applied to modern datasets such as the CVL database (98.6\% mAP) without any fine-tuning.
