A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark

Jakub Paplham; Vojtech Franc

A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark

Jakub Paplham, Vojtech Franc

TL;DR

This work identifies and addresses major inconsistencies in evaluating facial age estimation methods, showing that many reported gains from specialized loss functions or rearranged network components disappear when evaluation is standardized and larger pretraining data are used. By defining a reproducible intra- and cross-dataset protocol and isolating components, the authors demonstrate that data, preprocessing, and backbone choices have outsized impact on performance, while the incremental benefit of novel losses is often negligible. They propose using the FaRL backbone as a robust, scalable baseline and validate it across seven public datasets, underscoring the practical value of pretraining data scale over architectural tinkering. The study culminates in a unified benchmark with public code and data splits, offering a clear path for fair comparisons and more reliable, real-world applicability in facial age estimation.

Abstract

Comparing different age estimation methods poses a challenge due to the unreliability of published results stemming from inconsistencies in the benchmarking process. Previous studies have reported continuous performance improvements over the past decade using specialized methods; however, our findings challenge these claims. This paper identifies two trivial, yet persistent issues with the currently used evaluation protocol and describes how to resolve them. We offer an extensive comparative analysis for state-of-the-art facial age estimation methods. Surprisingly, we find that the performance differences between the methods are negligible compared to the effect of other factors, such as facial alignment, facial coverage, image resolution, model architecture, or the amount of data used for pretraining. We use the gained insights to propose using FaRL as the backbone model and demonstrate its effectiveness on all public datasets. We make the source code and exact data splits public on GitHub.

A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark

TL;DR

Abstract

Paper Structure (38 sections, 3 figures, 7 tables)

This paper contains 38 sections, 3 figures, 7 tables.

Introduction
Contributions
Issues with Current Evaluation Practices
Data Splits
Pipeline Ablation
Evaluation Protocol
Intra-dataset performance
Cross-dataset performance
Comparative Method Analysis
Methodology
Datasets
Data Splits
Model Architecture & Weight Initialization
Training Details
Preprocessing
...and 23 more sections

Figures (3)

Figure 1: Mean Absolute Error (MAE) $\downarrow$ of age estimation methods on the MORPH dataset, as reported in the existing literature and measured by us, viewed over time. Random splitting remains the prevalent data splitting strategy. The consistent performance improvements over time are attributed in the literature to specialized loss functions for age estimation. Subject-exclusive (identity-disjoint) data splitting is rarely employed. With unified subject-exclusive data splitting and all factors except the loss function fixed, all evaluated methods yield comparable results, failing to achieve the performance gains promised by the random splitting.
Figure 2: Comparison of different alignment methods using the average face from the FG-NET dataset.
Figure 3: Comparison of different facial coverage levels using the average face from the FG-NET dataset.

A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark

TL;DR

Abstract

A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (3)