Table of Contents
Fetching ...

VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation

Rakib Hossain Sajib, Md Kishor Morol, Rajan Das Gupta, Mohammad Sakib Mahmood, Shuvra Smaran Das

Abstract

Human age estimation from facial images represents a challenging computer vision task with significant applications in biometrics, healthcare, and human-computer interaction. While traditional deep learning approaches require extensive labeled datasets and domain-specific training, recent advances in large vision-language models (LVLMs) offer the potential for zero-shot age estimation. This study presents a comprehensive zero-shot evaluation of state-of-the-art Large Vision-Language Models (LVLMs) for facial age estimation, a task traditionally dominated by domain-specific convolutional networks and supervised learning. We assess the performance of GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision on two benchmark datasets, UTKFace and FG-NET, without any fine-tuning or task-specific adaptation. Using eight evaluation metrics, including MAE, MSE, RMSE, MAPE, MBE, $R^2$, CCC, and $\pm$5-year accuracy, we demonstrate that general-purpose LVLMs can deliver competitive performance in zero-shot settings. Our findings highlight the emergent capabilities of LVLMs for accurate biometric age estimation and position these models as promising tools for real-world applications. Additionally, we highlight performance disparities linked to image quality and demographic subgroups, underscoring the need for fairness-aware multimodal inference. This work introduces a reproducible benchmark and positions LVLMs as promising tools for real-world applications in forensic science, healthcare monitoring, and human-computer interaction. The benchmark focuses on strict zero-shot inference without fine-tuning and highlights remaining challenges related to prompt sensitivity, interpretability, computational cost, and demographic fairness.

VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation

Abstract

Human age estimation from facial images represents a challenging computer vision task with significant applications in biometrics, healthcare, and human-computer interaction. While traditional deep learning approaches require extensive labeled datasets and domain-specific training, recent advances in large vision-language models (LVLMs) offer the potential for zero-shot age estimation. This study presents a comprehensive zero-shot evaluation of state-of-the-art Large Vision-Language Models (LVLMs) for facial age estimation, a task traditionally dominated by domain-specific convolutional networks and supervised learning. We assess the performance of GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision on two benchmark datasets, UTKFace and FG-NET, without any fine-tuning or task-specific adaptation. Using eight evaluation metrics, including MAE, MSE, RMSE, MAPE, MBE, , CCC, and 5-year accuracy, we demonstrate that general-purpose LVLMs can deliver competitive performance in zero-shot settings. Our findings highlight the emergent capabilities of LVLMs for accurate biometric age estimation and position these models as promising tools for real-world applications. Additionally, we highlight performance disparities linked to image quality and demographic subgroups, underscoring the need for fairness-aware multimodal inference. This work introduces a reproducible benchmark and positions LVLMs as promising tools for real-world applications in forensic science, healthcare monitoring, and human-computer interaction. The benchmark focuses on strict zero-shot inference without fine-tuning and highlights remaining challenges related to prompt sensitivity, interpretability, computational cost, and demographic fairness.

Paper Structure

This paper contains 20 sections, 8 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Pipeline for Age Estimation Using Large Vision-Language Models (LVLMs). Input images are processed by three LVLMs (LLaMA 3.2 Vision, GPT-4o, and Claude 3.5 Sonnet) to generate individual age predictions, which are then collected in a unified prediction table (CSV) and evaluated using identical performance metrics.
  • Figure 2: Zero-shot facial age estimation by LVLMs. (a) An input image and standardized prompt instruct the model to return a numeric age without explanation. (b) Predictions from LLaMA 3.2 Vision, GPT-4o, and Claude 3.5 Sonnet are shown, demonstrating their numeric outputs for the same input.
  • Figure 3: Model Performance Comparison on UTKFace Dataset
  • Figure 4: Model Performance Comparison on FG-NET Dataset
  • Figure 5: Radar chart visualization of model performance on UTKFace dataset, illustrating the relative advantages of each LVLM across different evaluation criteria.
  • ...and 1 more figures