Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures

Simiao Ren

Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures

Simiao Ren

TL;DR

This work addresses whether zero-shot vision-language models can match or exceed task-specific age estimation architectures by introducing the first large-scale cross-paradigm benchmark. It evaluates 34 models (22 specialized non-LLMs and 12 zero-shot VLMs) across 8 diverse datasets using MAE as the primary metric, plus age-threshold and age-bin analyses. The results show zero-shot VLMs dominate the top ranks (average MAE 5.65) and substantially improve age-verification safety (FAR 13–25% vs 60–100% for many non-LLMs), with MiVOLO as the strongest non-LLM rival. A notable finding is the detrimental effect of coarse-age binning (8–9 classes) on MAE, coupled with a reproducibility gap due to unavailable pretrained weights, prompting calls for knowledge distillation and standardized releases to advance practical, fair deployment.

Abstract

Facial age estimation is critical for content moderation, age verification, and deepfake detection, yet no prior benchmark has systematically compared modern vision-language models (VLMs) against specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating \textbf{34 models} -- 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs -- across \textbf{8 standard datasets} (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, AgeDB) totaling 1{,}100 test images per model. Our key finding is striking: \emph{zero-shot VLMs significantly outperform most specialized models}, achieving an average MAE of 5.65 years compared to 9.88 for non-LLM models. The best VLM (Gemini~3 Flash Preview, MAE~4.32) outperforms the best non-LLM model (MiVOLO, MAE~5.10) by 15\%. Only MiVOLO, which uniquely combines face and body features via Vision Transformers, competes with VLMs. We further analyze age verification at the 18-year threshold, revealing that non-LLM models exhibit 60--100\% false adult rates on minors while VLMs achieve 13--25\%, and demonstrate that coarse age binning (8--9 classes) consistently degrades MAE beyond 13 years. Our stratified analysis across 14 age groups reveals that all models struggle most at extreme ages ($<$5 and 65+). These findings challenge the assumption that task-specific architectures are necessary for age estimation and suggest that the field should redirect toward distilling VLM capabilities into efficient specialized models.

Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures

TL;DR

Abstract

5 and 65+). These findings challenge the assumption that task-specific architectures are necessary for age estimation and suggest that the field should redirect toward distilling VLM capabilities into efficient specialized models.

Paper Structure (23 sections, 6 figures, 5 tables)

This paper contains 23 sections, 6 figures, 5 tables.

Introduction
Related Work
Traditional Age Estimation.
Existing Benchmarks.
VLMs for Facial Analysis.
Age Verification and Regulatory Context.
Benchmark Design
Models
Datasets
Evaluation Protocol
Results
Overall Performance
Why MiVOLO Stands Alone
The Coarse Binning Problem
Age Verification at the 18-Year Threshold
...and 8 more sections

Figures (6)

Figure 1: Complete MAE heatmap across all 34 models and 8 datasets. Models sorted by average MAE (top = best). Orange labels = VLM, blue labels = non-LLM. Darker cells indicate higher error. The top tier is dominated by VLMs with uniformly low errors across all datasets.
Figure 2: All 34 models ranked by average MAE. Orange = VLM, blue = non-LLM. VLMs occupy 12 of the top 13 positions.
Figure 3: Distribution of average MAE by model type. The VLM interquartile range falls entirely below the non-LLM median.
Figure 4: MAE by architecture category. Each dot is one model; black bars show category means. Coarse-bin models are uniformly poor. VLMs show the tightest cluster with lowest mean.
Figure 5: Error rates at the 18-year threshold. Left: False Adult Rate (minors missed); Right: False Minor Rate. Models sorted by FAR. VLM model names in orange.
...and 1 more figures

Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures

TL;DR

Abstract

Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (6)