Improved Ear Verification with Vision Transformers and Overlapping Patches
Deeksha Arun, Kagan Ozturk, Kevin W. Bowyer, Patrick Flynn
TL;DR
This work investigates improving ear verification by applying Vision Transformers with overlapping patch strategies. Trained on a large, diverse UERC2023-derived dataset and evaluated on four real-world datasets (OPIB, AWE, WPUT, EarVN1.0), the study systematically compares four ViT configurations (ViT-T, S, B, L) across patch sizes and strides. Key findings show overlapping patches substantially boost verification performance, with ViT-T and a patch setup of 28x28/14 delivering top results on several datasets, and half-patch strides outperforming non-overlapping configurations in the majority of cases. The results establish a practical, lightweight transformer-based framework for ear verification and offer guidelines for patch-stride design in biometric recognition tasks.
Abstract
Ear recognition has emerged as a promising biometric modality due to the relative stability in appearance during adulthood. Although Vision Transformers (ViTs) have been widely used in image recognition tasks, their efficiency in ear recognition has been hampered by a lack of attention to overlapping patches, which is crucial for capturing intricate ear features. In this study, we evaluate ViT-Tiny (ViT-T), ViT-Small (ViT-S), ViT-Base (ViT-B) and ViT-Large (ViT-L) configurations on a diverse set of datasets (OPIB, AWE, WPUT, and EarVN1.0), using an overlapping patch selection strategy. Results demonstrate the critical importance of overlapping patches, yielding superior performance in 44 of 48 experiments in a structured study. Moreover, upon comparing the results of the overlapping patches with the non-overlapping configurations, the increase is significant, reaching up to 10% for the EarVN1.0 dataset. In terms of model performance, the ViT-T model consistently outperformed the ViT-S, ViT-B, and ViT-L models on the AWE, WPUT, and EarVN1.0 datasets. The highest scores were achieved in a configuration with a patch size of 28x28 and a stride of 14 pixels. This patch-stride configuration represents 25% of the normalized image area (112x112 pixels) for the patch size and 12.5% of the row or column size for the stride. This study confirms that transformer architectures with overlapping patch selection can serve as an efficient and high-performing option for ear-based biometric recognition tasks in verification scenarios.
