Integrating ConvNeXt and Vision Transformers for Enhancing Facial Age Estimation
Gaby Maroun, Salah Eddine Bekhouche, Fadi Dornaika
TL;DR
This work addresses facial age estimation by proposing a novel ConvNeXt-Transformer hybrid that fuses CNN-based local feature extraction with Transformer-based global context. The architecture processes ConvNeXt-created feature maps through a ViT-style encoder and uses an MLP head for regression, with a two-stage training regime beginning from ImageNet pretraining and followed by fine-tuning on MORPH II, CACD, AFAD, and IMDB-Clean. Comprehensive ablations show the value of two linear-headed ConvNeXt variants, ViT-only baselines, and the combined hybrid, achieving competitive MAEs and strong CS@5 performance across datasets. The findings highlight the potential of hybrid CNN-Transformer models to better capture localized aging cues and dispersed age-related patterns, offering a robust path toward more accurate and transferable facial age estimation systems.
Abstract
Age estimation from facial images is a complex and multifaceted challenge in computer vision. In this study, we present a novel hybrid architecture that combines ConvNeXt, a state-of-the-art advancement of convolutional neural networks (CNNs), with Vision Transformers (ViT). While each model independently delivers excellent performance on a variety of tasks, their integration leverages the complementary strengths of the CNNs localized feature extraction capabilities and the Transformers global attention mechanisms. Our proposed ConvNeXt-ViT hybrid solution was thoroughly evaluated on benchmark age estimation datasets, including MORPH II, CACD, and AFAD, and achieved superior performance in terms of mean absolute error (MAE). To address computational constraints, we leverage pre-trained models and systematically explore different configurations, using linear layers and advanced regularization techniques to optimize the architecture. Comprehensive ablation studies highlight the critical role of individual components and training strategies, and in particular emphasize the importance of adapted attention mechanisms within the CNN framework to improve the model focus on age-relevant facial features. The results show that the ConvNeXt-ViT hybrid not only outperforms traditional methods, but also provides a robust foundation for future advances in age estimation and related visual tasks. This work underscores the transformative potential of hybrid architectures and represents a promising direction for the seamless integration of CNNs and transformers to address complex computer vision challenges.
