Table of Contents
Fetching ...

Improved Ear Verification with Vision Transformers and Overlapping Patches

Deeksha Arun, Kagan Ozturk, Kevin W. Bowyer, Patrick Flynn

TL;DR

This work investigates improving ear verification by applying Vision Transformers with overlapping patch strategies. Trained on a large, diverse UERC2023-derived dataset and evaluated on four real-world datasets (OPIB, AWE, WPUT, EarVN1.0), the study systematically compares four ViT configurations (ViT-T, S, B, L) across patch sizes and strides. Key findings show overlapping patches substantially boost verification performance, with ViT-T and a patch setup of 28x28/14 delivering top results on several datasets, and half-patch strides outperforming non-overlapping configurations in the majority of cases. The results establish a practical, lightweight transformer-based framework for ear verification and offer guidelines for patch-stride design in biometric recognition tasks.

Abstract

Ear recognition has emerged as a promising biometric modality due to the relative stability in appearance during adulthood. Although Vision Transformers (ViTs) have been widely used in image recognition tasks, their efficiency in ear recognition has been hampered by a lack of attention to overlapping patches, which is crucial for capturing intricate ear features. In this study, we evaluate ViT-Tiny (ViT-T), ViT-Small (ViT-S), ViT-Base (ViT-B) and ViT-Large (ViT-L) configurations on a diverse set of datasets (OPIB, AWE, WPUT, and EarVN1.0), using an overlapping patch selection strategy. Results demonstrate the critical importance of overlapping patches, yielding superior performance in 44 of 48 experiments in a structured study. Moreover, upon comparing the results of the overlapping patches with the non-overlapping configurations, the increase is significant, reaching up to 10% for the EarVN1.0 dataset. In terms of model performance, the ViT-T model consistently outperformed the ViT-S, ViT-B, and ViT-L models on the AWE, WPUT, and EarVN1.0 datasets. The highest scores were achieved in a configuration with a patch size of 28x28 and a stride of 14 pixels. This patch-stride configuration represents 25% of the normalized image area (112x112 pixels) for the patch size and 12.5% of the row or column size for the stride. This study confirms that transformer architectures with overlapping patch selection can serve as an efficient and high-performing option for ear-based biometric recognition tasks in verification scenarios.

Improved Ear Verification with Vision Transformers and Overlapping Patches

TL;DR

This work investigates improving ear verification by applying Vision Transformers with overlapping patch strategies. Trained on a large, diverse UERC2023-derived dataset and evaluated on four real-world datasets (OPIB, AWE, WPUT, EarVN1.0), the study systematically compares four ViT configurations (ViT-T, S, B, L) across patch sizes and strides. Key findings show overlapping patches substantially boost verification performance, with ViT-T and a patch setup of 28x28/14 delivering top results on several datasets, and half-patch strides outperforming non-overlapping configurations in the majority of cases. The results establish a practical, lightweight transformer-based framework for ear verification and offer guidelines for patch-stride design in biometric recognition tasks.

Abstract

Ear recognition has emerged as a promising biometric modality due to the relative stability in appearance during adulthood. Although Vision Transformers (ViTs) have been widely used in image recognition tasks, their efficiency in ear recognition has been hampered by a lack of attention to overlapping patches, which is crucial for capturing intricate ear features. In this study, we evaluate ViT-Tiny (ViT-T), ViT-Small (ViT-S), ViT-Base (ViT-B) and ViT-Large (ViT-L) configurations on a diverse set of datasets (OPIB, AWE, WPUT, and EarVN1.0), using an overlapping patch selection strategy. Results demonstrate the critical importance of overlapping patches, yielding superior performance in 44 of 48 experiments in a structured study. Moreover, upon comparing the results of the overlapping patches with the non-overlapping configurations, the increase is significant, reaching up to 10% for the EarVN1.0 dataset. In terms of model performance, the ViT-T model consistently outperformed the ViT-S, ViT-B, and ViT-L models on the AWE, WPUT, and EarVN1.0 datasets. The highest scores were achieved in a configuration with a patch size of 28x28 and a stride of 14 pixels. This patch-stride configuration represents 25% of the normalized image area (112x112 pixels) for the patch size and 12.5% of the row or column size for the stride. This study confirms that transformer architectures with overlapping patch selection can serve as an efficient and high-performing option for ear-based biometric recognition tasks in verification scenarios.

Paper Structure

This paper contains 13 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: A schematic representation of the ear recognition model trained using Vision Transformers (ViTs)
  • Figure 2: Comparison of overlapping and non-overlapping patches: In this example, we consider the cases- p56_s56 (left part--non-overlapping patches) and p56_s28 (right part--overlapping patches), where p56_s56 implies a patch size of 56 and a stride of 56, and p56_s28 represents a patch size of 56 and a stride of 28. These configurations result in 4 and 9 patches respectively, which can be calculated using \ref{['NPatch']}. The right part shows how the stride setting captures inter-patch information, thus aiding in the improvement of ear verification performance.
  • Figure 3: Overview of the Ear Transformer: The Ear Transformer splits images into overlapping patches, processed as tokens by the transformer encoder. This modification to ViT dosovitskiy2020image improves inter-patch relationships and performance.
  • Figure 4: Percentage variation, $PV(A,B)$ as defined by \ref{['eq:PV']}, for all the datasets and for all the patch-stride pairs across all ViT Configurations (T, S, B and L). The setting $(A,B)$ refers to: $A:$ stride size is half of the patch size; $B:$ stride size is equal to the patch size.
  • Figure 5: Comparison of Area Under the Curve (AUC) scores for different patch and stride configurations across all models for the EarVN1.0 dataset.
  • ...and 1 more figures