Table of Contents
Fetching ...

Self-Supervised Vision Transformers for Writer Retrieval

Tim Raven, Arthur Matei, Gernot A. Fink

TL;DR

This work addresses writer retrieval for historical handwritten documents using a self-supervised Vision Transformer (ViT) feature extractor. It intentionally uses foreground patch tokens from the ViT and aggregates them with VLAD, avoiding fine-tuning and relying on cosine similarity with optional reranking. The approach achieves state-of-the-art results on Historical-WI ($mAP=83.1 ext{%}$ with reranking) and HisIR19 ($mAP=95.0 ext{%}$ with reranking), and strong performance on CVL without fine-tuning, demonstrating robust generalization to modern handwriting. The findings show that foreground token VLAD features learned through SSL surpass traditional CNN/hard-feature methods and that selective token filtering plus efficient aggregation substantially boosts writer retrieval performance, with potential for further gains via alternative SSL strategies and architectures.

Abstract

While methods based on Vision Transformers (ViT) have achieved state-of-the-art performance in many domains, they have not yet been applied successfully in the domain of writer retrieval. The field is dominated by methods using handcrafted features or features extracted from Convolutional Neural Networks. In this work, we bridge this gap and present a novel method that extracts features from a ViT and aggregates them using VLAD encoding. The model is trained in a self-supervised fashion without any need for labels. We show that extracting local foreground features is superior to using the ViT's class token in the context of writer retrieval. We evaluate our method on two historical document collections. We set a new state-at-of-art performance on the Historical-WI dataset (83.1\% mAP), and the HisIR19 dataset (95.0\% mAP). Additionally, we demonstrate that our ViT feature extractor can be directly applied to modern datasets such as the CVL database (98.6\% mAP) without any fine-tuning.

Self-Supervised Vision Transformers for Writer Retrieval

TL;DR

This work addresses writer retrieval for historical handwritten documents using a self-supervised Vision Transformer (ViT) feature extractor. It intentionally uses foreground patch tokens from the ViT and aggregates them with VLAD, avoiding fine-tuning and relying on cosine similarity with optional reranking. The approach achieves state-of-the-art results on Historical-WI ( with reranking) and HisIR19 ( with reranking), and strong performance on CVL without fine-tuning, demonstrating robust generalization to modern handwriting. The findings show that foreground token VLAD features learned through SSL surpass traditional CNN/hard-feature methods and that selective token filtering plus efficient aggregation substantially boosts writer retrieval performance, with potential for further gains via alternative SSL strategies and architectures.

Abstract

While methods based on Vision Transformers (ViT) have achieved state-of-the-art performance in many domains, they have not yet been applied successfully in the domain of writer retrieval. The field is dominated by methods using handcrafted features or features extracted from Convolutional Neural Networks. In this work, we bridge this gap and present a novel method that extracts features from a ViT and aggregates them using VLAD encoding. The model is trained in a self-supervised fashion without any need for labels. We show that extracting local foreground features is superior to using the ViT's class token in the context of writer retrieval. We evaluate our method on two historical document collections. We set a new state-at-of-art performance on the Historical-WI dataset (83.1\% mAP), and the HisIR19 dataset (95.0\% mAP). Additionally, we demonstrate that our ViT feature extractor can be directly applied to modern datasets such as the CVL database (98.6\% mAP) without any fine-tuning.
Paper Structure (31 sections, 5 equations, 6 figures, 4 tables)

This paper contains 31 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of our proposed method. Document images are cut into windows in a regular grid. The windows are again cut into patches and form the input sequence. To extract local features, a self-supervised Vision Transformer (ViT) is used. We extract only foreground patch tokens from the ViT output sequence, i.e., only the patch tokens of input patches with sufficient handwriting. All foreground tokens of the document are aggregated using a VLAD encoding. These encodings are used for retrieval and reranking.
  • Figure 2: Visualization of sample document images from the Historical-WI dataset: (a) and (b) show color images, (c) shows a binarized image provided in the dataset.
  • Figure 3: Visualization of sample document images from the HisIR19 dataset.
  • Figure 4: Evaluation of $S_\text{eval}$,i.e., the stride with which windows are sampled from the test documents during inference on the Historical-WI dataset. We compare different combinations of features and aggregating methods. The left plot shows mAP and the right plot shows Top1 accuracy.
  • Figure 5: Evaluation of parameter $C$, i.e., the number of cluster centers used to compute the VLAD codebook $\Theta$. The left plot shows mAP and the right plot shows Top1 accuracy.
  • ...and 1 more figures