Table of Contents
Fetching ...

Decorrelation-based Self-Supervised Visual Representation Learning for Writer Identification

Arkadip Maitra, Shree Mitra, Siladittya Manna, Saumik Bhattacharya, Umapada Pal

TL;DR

This work addresses text-independent writer identification by learning disentangled stroke representations through a decorrelation-based self-supervised objective. It introduces a patch-based encoder with a novel loss, $L_c$, that enforces per-dimension decorrelation and independence under a joint distribution, aided by per-dimension standardization and whitening. Empirical results on IAM, CVL, and Firemaker show competitive word- and state-of-the-art page-level accuracies, with favorable computational efficiency and strong semi-supervised fine-tuning performance. The approach is statistically validated via correlation analyses and t-tests, demonstrating substantive decorrelation of patch-level features and improved robustness for handwriting verification tasks.

Abstract

Self-supervised learning has developed rapidly over the last decade and has been applied in many areas of computer vision. Decorrelation-based self-supervised pretraining has shown great promise among non-contrastive algorithms, yielding performance at par with supervised and contrastive self-supervised baselines. In this work, we explore the decorrelation-based paradigm of self-supervised learning and apply the same to learning disentangled stroke features for writer identification. Here we propose a modified formulation of the decorrelation-based framework named SWIS which was proposed for signature verification by standardizing the features along each dimension on top of the existing framework. We show that the proposed framework outperforms the contemporary self-supervised learning framework on the writer identification benchmark and also outperforms several supervised methods as well. To the best of our knowledge, this work is the first of its kind to apply self-supervised learning for learning representations for writer verification tasks.

Decorrelation-based Self-Supervised Visual Representation Learning for Writer Identification

TL;DR

This work addresses text-independent writer identification by learning disentangled stroke representations through a decorrelation-based self-supervised objective. It introduces a patch-based encoder with a novel loss, , that enforces per-dimension decorrelation and independence under a joint distribution, aided by per-dimension standardization and whitening. Empirical results on IAM, CVL, and Firemaker show competitive word- and state-of-the-art page-level accuracies, with favorable computational efficiency and strong semi-supervised fine-tuning performance. The approach is statistically validated via correlation analyses and t-tests, demonstrating substantive decorrelation of patch-level features and improved robustness for handwriting verification tasks.

Abstract

Self-supervised learning has developed rapidly over the last decade and has been applied in many areas of computer vision. Decorrelation-based self-supervised pretraining has shown great promise among non-contrastive algorithms, yielding performance at par with supervised and contrastive self-supervised baselines. In this work, we explore the decorrelation-based paradigm of self-supervised learning and apply the same to learning disentangled stroke features for writer identification. Here we propose a modified formulation of the decorrelation-based framework named SWIS which was proposed for signature verification by standardizing the features along each dimension on top of the existing framework. We show that the proposed framework outperforms the contemporary self-supervised learning framework on the writer identification benchmark and also outperforms several supervised methods as well. To the best of our knowledge, this work is the first of its kind to apply self-supervised learning for learning representations for writer verification tasks.
Paper Structure (26 sections, 4 equations, 4 figures, 3 tables)

This paper contains 26 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Samples images from the CVL dataset show the sparse nature of the handwritten text dataset images
  • Figure 2: Figure showing the proposed framework. A single image is augmented to form a positive pair, which is fed to the weight-shared base encoder for feature extraction. After reshaping the output from the encoder, the feature vectors are passed through the projector. The final feature vector is passed to the loss function for loss calculation and subsequent optimization. $T_1$ and $T_2$ are the two different augmentations applied on the input image to obtain the positive pair.
  • Figure 3: Correlation Map (a-d) of different patches in a single word image with each other. The correlation is calculated with the features extracted from the encoder pre-trained using the proposed framework. The figures (e-h) show the KDE plots of the correlation values. Figures (i-l) show the p-values of the left-tailed t-test conducted on the correlation values as described in Sec. \ref{['sec:statanal']}.
  • Figure 4: Cumulative Distribution (a) and Quantile-Quantile (Q-Q) Plot (b) for the samples presented in Fig. \ref{['fig:corrmap']}(a). It shows that the correlation values satisfy the normality assumption and the application of t-test is justified.