Table of Contents
Fetching ...

Integrating ConvNeXt and Vision Transformers for Enhancing Facial Age Estimation

Gaby Maroun, Salah Eddine Bekhouche, Fadi Dornaika

TL;DR

This work addresses facial age estimation by proposing a novel ConvNeXt-Transformer hybrid that fuses CNN-based local feature extraction with Transformer-based global context. The architecture processes ConvNeXt-created feature maps through a ViT-style encoder and uses an MLP head for regression, with a two-stage training regime beginning from ImageNet pretraining and followed by fine-tuning on MORPH II, CACD, AFAD, and IMDB-Clean. Comprehensive ablations show the value of two linear-headed ConvNeXt variants, ViT-only baselines, and the combined hybrid, achieving competitive MAEs and strong CS@5 performance across datasets. The findings highlight the potential of hybrid CNN-Transformer models to better capture localized aging cues and dispersed age-related patterns, offering a robust path toward more accurate and transferable facial age estimation systems.

Abstract

Age estimation from facial images is a complex and multifaceted challenge in computer vision. In this study, we present a novel hybrid architecture that combines ConvNeXt, a state-of-the-art advancement of convolutional neural networks (CNNs), with Vision Transformers (ViT). While each model independently delivers excellent performance on a variety of tasks, their integration leverages the complementary strengths of the CNNs localized feature extraction capabilities and the Transformers global attention mechanisms. Our proposed ConvNeXt-ViT hybrid solution was thoroughly evaluated on benchmark age estimation datasets, including MORPH II, CACD, and AFAD, and achieved superior performance in terms of mean absolute error (MAE). To address computational constraints, we leverage pre-trained models and systematically explore different configurations, using linear layers and advanced regularization techniques to optimize the architecture. Comprehensive ablation studies highlight the critical role of individual components and training strategies, and in particular emphasize the importance of adapted attention mechanisms within the CNN framework to improve the model focus on age-relevant facial features. The results show that the ConvNeXt-ViT hybrid not only outperforms traditional methods, but also provides a robust foundation for future advances in age estimation and related visual tasks. This work underscores the transformative potential of hybrid architectures and represents a promising direction for the seamless integration of CNNs and transformers to address complex computer vision challenges.

Integrating ConvNeXt and Vision Transformers for Enhancing Facial Age Estimation

TL;DR

This work addresses facial age estimation by proposing a novel ConvNeXt-Transformer hybrid that fuses CNN-based local feature extraction with Transformer-based global context. The architecture processes ConvNeXt-created feature maps through a ViT-style encoder and uses an MLP head for regression, with a two-stage training regime beginning from ImageNet pretraining and followed by fine-tuning on MORPH II, CACD, AFAD, and IMDB-Clean. Comprehensive ablations show the value of two linear-headed ConvNeXt variants, ViT-only baselines, and the combined hybrid, achieving competitive MAEs and strong CS@5 performance across datasets. The findings highlight the potential of hybrid CNN-Transformer models to better capture localized aging cues and dispersed age-related patterns, offering a robust path toward more accurate and transferable facial age estimation systems.

Abstract

Age estimation from facial images is a complex and multifaceted challenge in computer vision. In this study, we present a novel hybrid architecture that combines ConvNeXt, a state-of-the-art advancement of convolutional neural networks (CNNs), with Vision Transformers (ViT). While each model independently delivers excellent performance on a variety of tasks, their integration leverages the complementary strengths of the CNNs localized feature extraction capabilities and the Transformers global attention mechanisms. Our proposed ConvNeXt-ViT hybrid solution was thoroughly evaluated on benchmark age estimation datasets, including MORPH II, CACD, and AFAD, and achieved superior performance in terms of mean absolute error (MAE). To address computational constraints, we leverage pre-trained models and systematically explore different configurations, using linear layers and advanced regularization techniques to optimize the architecture. Comprehensive ablation studies highlight the critical role of individual components and training strategies, and in particular emphasize the importance of adapted attention mechanisms within the CNN framework to improve the model focus on age-relevant facial features. The results show that the ConvNeXt-ViT hybrid not only outperforms traditional methods, but also provides a robust foundation for future advances in age estimation and related visual tasks. This work underscores the transformative potential of hybrid architectures and represents a promising direction for the seamless integration of CNNs and transformers to address complex computer vision challenges.

Paper Structure

This paper contains 20 sections, 3 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: General structure of the proposed ConvNeXt-Transformer architecture.
  • Figure 2: Details of a ConvNeXt Block.
  • Figure 3: Training Loss vs. Epochs for Different Models. The plot shows the training loss values plotted against the number of epochs for three different models on the MORPH II album dataset: ConvNeXt with the MAE loss function, ViT with the MAE loss function, ConvNeXt-Transformers with both, the MAE loss function and the adaptive one.
  • Figure 4: Cumulative Distribution of Absolute Error for 3 Different Models on the 3 Different Albums (MOPRH2, CACD and AFAD). The cumulative distribution plots provide an insightful comparison of model performance, showcasing the spread and density of absolute errors across the dataset. The example shown here demonstrates the best degrees of accuracy and precision we achieved in each model, offering valuable insights into their respective capabilities on the different dataset we're working on.