Table of Contents
Fetching ...

Contrastive Learning for Character Detection in Ancient Greek Papyri

Vedasri Nakka, Andreas Fischer, Rolf Ingold, Lars Vogtlin

TL;DR

SimCLR's cropping strategies may cause a semantic shift in the input image, reducing training effectiveness despite the large pretraining dataset, and it is believed SimCLR's cropping strategies may cause a semantic shift in the input image, reducing training effectiveness despite the large pretraining dataset.

Abstract

This thesis investigates the effectiveness of SimCLR, a contrastive learning technique, in Greek letter recognition, focusing on the impact of various augmentation techniques. We pretrain the SimCLR backbone using the Alpub dataset (pretraining dataset) and fine-tune it on a smaller ICDAR dataset (finetuning dataset) to compare SimCLR's performance against traditional baseline models, which use cross-entropy and triplet loss functions. Additionally, we explore the role of different data augmentation strategies, essential for the SimCLR training process. Methodologically, we examine three primary approaches: (1) a baseline model using cross-entropy loss, (2) a triplet embedding model with a classification layer, and (3) a SimCLR pretrained model with a classification layer. Initially, we train the baseline, triplet, and SimCLR models using 93 augmentations on ResNet-18 and ResNet-50 networks with the ICDAR dataset. From these, the top four augmentations are selected using a statistical t-test. Pretraining of SimCLR is conducted on the Alpub dataset, followed by fine-tuning on the ICDAR dataset. The triplet loss model undergoes a similar process, being pretrained on the top four augmentations before fine-tuning on ICDAR. Our experiments show that SimCLR does not outperform the baselines in letter recognition tasks. The baseline model with cross-entropy loss demonstrates better performance than both SimCLR and the triplet loss model. This study provides a detailed evaluation of contrastive learning for letter recognition, highlighting SimCLR's limitations while emphasizing the strengths of traditional supervised learning models in this task. We believe SimCLR's cropping strategies may cause a semantic shift in the input image, reducing training effectiveness despite the large pretraining dataset. Our code is available at https://github.com/DIVA-DIA/MT_augmentation_and_contrastive_learning/.

Contrastive Learning for Character Detection in Ancient Greek Papyri

TL;DR

SimCLR's cropping strategies may cause a semantic shift in the input image, reducing training effectiveness despite the large pretraining dataset, and it is believed SimCLR's cropping strategies may cause a semantic shift in the input image, reducing training effectiveness despite the large pretraining dataset.

Abstract

This thesis investigates the effectiveness of SimCLR, a contrastive learning technique, in Greek letter recognition, focusing on the impact of various augmentation techniques. We pretrain the SimCLR backbone using the Alpub dataset (pretraining dataset) and fine-tune it on a smaller ICDAR dataset (finetuning dataset) to compare SimCLR's performance against traditional baseline models, which use cross-entropy and triplet loss functions. Additionally, we explore the role of different data augmentation strategies, essential for the SimCLR training process. Methodologically, we examine three primary approaches: (1) a baseline model using cross-entropy loss, (2) a triplet embedding model with a classification layer, and (3) a SimCLR pretrained model with a classification layer. Initially, we train the baseline, triplet, and SimCLR models using 93 augmentations on ResNet-18 and ResNet-50 networks with the ICDAR dataset. From these, the top four augmentations are selected using a statistical t-test. Pretraining of SimCLR is conducted on the Alpub dataset, followed by fine-tuning on the ICDAR dataset. The triplet loss model undergoes a similar process, being pretrained on the top four augmentations before fine-tuning on ICDAR. Our experiments show that SimCLR does not outperform the baselines in letter recognition tasks. The baseline model with cross-entropy loss demonstrates better performance than both SimCLR and the triplet loss model. This study provides a detailed evaluation of contrastive learning for letter recognition, highlighting SimCLR's limitations while emphasizing the strengths of traditional supervised learning models in this task. We believe SimCLR's cropping strategies may cause a semantic shift in the input image, reducing training effectiveness despite the large pretraining dataset. Our code is available at https://github.com/DIVA-DIA/MT_augmentation_and_contrastive_learning/.
Paper Structure (81 sections, 3 equations, 13 figures, 10 tables)

This paper contains 81 sections, 3 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Reference images. Sample full images from the ICDAR dataset containing Greek letters. We use ground-truth annotations to crop each letter subimage from the full images. We train models on the cropped letter images, and the finally evaluate the performance on unseen test data.
  • Figure 3: Visualizations of the pixel-level augmentations applied in our experiments.The first block shows the original resized image at a $256 \times 256$ resolution, and the following blocks show the resulting visualizations from different pixel-level augmentation strategies.
  • Figure 4: Baseline model pipeline. Spatial (eg., randomcrop224) and pixel-level augmentations (eg., colorjitter) are first performed on the original raw image to obtain the final pre-processed image. The augmented image is then passed through the model, which extracts a feature vector at the end of the backbone. The final feature vector is sent to the classification layer to produce output probabilities.
  • Figure 5: Batch of images used in Triplet model training. The first row represents the anchors, the second row represents the positives, which contain images with the same label as the anchors, and the third row shows the negatives, which are images with different labels from the anchors.
  • Figure 6: Triplet Model Pipeline. The top block, highlighted with a green background, represents the pretraining stage, where the model is trained on the pretraining dataset using triplet loss to learn embeddings from triplet pairs. In the next stage, a classification layer is added on top of the embedding layer, and the model is trained end-to-end with cross-entropy (CE) loss.
  • ...and 8 more figures