Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures
Simiao Ren
TL;DR
This work addresses whether zero-shot vision-language models can match or exceed task-specific age estimation architectures by introducing the first large-scale cross-paradigm benchmark. It evaluates 34 models (22 specialized non-LLMs and 12 zero-shot VLMs) across 8 diverse datasets using MAE as the primary metric, plus age-threshold and age-bin analyses. The results show zero-shot VLMs dominate the top ranks (average MAE 5.65) and substantially improve age-verification safety (FAR 13–25% vs 60–100% for many non-LLMs), with MiVOLO as the strongest non-LLM rival. A notable finding is the detrimental effect of coarse-age binning (8–9 classes) on MAE, coupled with a reproducibility gap due to unavailable pretrained weights, prompting calls for knowledge distillation and standardized releases to advance practical, fair deployment.
Abstract
Facial age estimation is critical for content moderation, age verification, and deepfake detection, yet no prior benchmark has systematically compared modern vision-language models (VLMs) against specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating \textbf{34 models} -- 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs -- across \textbf{8 standard datasets} (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, AgeDB) totaling 1{,}100 test images per model. Our key finding is striking: \emph{zero-shot VLMs significantly outperform most specialized models}, achieving an average MAE of 5.65 years compared to 9.88 for non-LLM models. The best VLM (Gemini~3 Flash Preview, MAE~4.32) outperforms the best non-LLM model (MiVOLO, MAE~5.10) by 15\%. Only MiVOLO, which uniquely combines face and body features via Vision Transformers, competes with VLMs. We further analyze age verification at the 18-year threshold, revealing that non-LLM models exhibit 60--100\% false adult rates on minors while VLMs achieve 13--25\%, and demonstrate that coarse age binning (8--9 classes) consistently degrades MAE beyond 13 years. Our stratified analysis across 14 age groups reveals that all models struggle most at extreme ages ($<$5 and 65+). These findings challenge the assumption that task-specific architectures are necessary for age estimation and suggest that the field should redirect toward distilling VLM capabilities into efficient specialized models.
