Age-Defying Face Recognition with Transformer-Enhanced Loss
Pritesh Prakash, Anoop Kumar Rai
TL;DR
This work tackles aging-induced variability in face recognition by augmenting a standard metric-loss framework with a transformer-based auxiliary loss that operates on the CNN’s final feature maps. The method splits the backbone output into a conventional embedding and a transformer-derived embedding, combining them with a weighted loss to enhance age-invariant representations. Across LFW and age-diverse datasets CA-LFW and AgeDB, the transformer-enhanced loss improves performance over various angular-margin losses, and ablations show the approach can achieve state-of-the-art-like results while offering improved separation between genuine and impostor pairs. The study demonstrates that transformers can provide meaningful context guidance in vision tasks without replacing the backbone, suggesting avenues for adaptive transformer losses and broader applications in long-range feature modeling.
Abstract
Aging presents a significant challenge in face recognition, as changes in skin texture and tone can alter facial features over time, making it particularly difficult to compare images of the same individual taken years apart, such as in long-term identification scenarios. Transformer networks have the strength to preserve sequential spatial relationships caused by aging effect. This paper presents a technique for loss evaluation that uses a transformer network as an additive loss in the face recognition domain. The standard metric loss function typically takes the final embedding of the main CNN backbone as its input. Here, we employ a transformer-metric loss, a combined approach that integrates both transformer-loss and metric-loss. This research intends to analyze the transformer behavior on the convolution output when the CNN outcome is arranged in a sequential vector. These sequential vectors have the potential to overcome the texture or regional structure referred to as wrinkles or sagging skin affected by aging. The transformer encoder takes input from the contextual vectors obtained from the final convolution layer of the network. The learned features can be more age-invariant, complementing the discriminative power of the standard metric loss embedding. With this technique, we use transformer loss with various base metric-loss functions to evaluate the effect of the combined loss functions. We observe that such a configuration allows the network to achieve SoTA results in LFW and age-variant datasets (CA-LFW and AgeDB). This research expands the role of transformers in the machine vision domain and opens new possibilities for exploring transformers as a loss function.
