Table of Contents
Fetching ...

Can a Teenager Fool an AI? Evaluating Low-Cost Cosmetic Attacks on Age Estimation Systems

Xingyu Shen, Tommy Duong, Xiaodong An, Zengqi Zhao, Zebang Hu, Haoyu Hu, Ziyou Wang, Finn Guo, Simiao Ren

TL;DR

Whether simple, household-accessible cosmetic changes, including beards, grey hair, makeup, and simulated wrinkles, can cause AI age estimators to classify minors as adults is investigated and called for adversarial robustness evaluation as a mandatory criterion for model selection.

Abstract

Age estimation systems are increasingly deployed as gatekeepers for age-restricted online content, yet their robustness to cosmetic modifications has not been systematically evaluated. We investigate whether simple, household-accessible cosmetic changes, including beards, grey hair, makeup, and simulated wrinkles, can cause AI age estimators to classify minors as adults. To study this threat at scale without ethical concerns, we simulate these physical attacks on 329 facial images of individuals aged 10 to 21 using a VLM image editor (Gemini 2.5 Flash Image). We then evaluate eight models from our prior benchmark: five specialized architectures (MiVOLO, Custom-Best, Herosan, MiViaLab, DEX) and three vision-language models (Gemini 3 Flash, Gemini 2.5 Flash, GPT-5-Nano). We introduce the Attack Conversion Rate (ACR), defined as the fraction of images predicted as minor at baseline that flip to adult after attack, a population-agnostic metric that does not depend on the ratio of minors to adults in the test set. Our results reveal that a synthetic beard alone achieves 28 to 69 percent ACR across all eight models; combining all four attacks shifts predicted age by +7.7 years on average across all 329 subjects and reaches up to 83 percent ACR; and vision-language models exhibit lower ACR (59 to 71 percent) than specialized models (63 to 83 percent) under the full attack, although the ACR ranges overlap and the difference is not statistically tested. These findings highlight a critical vulnerability in deployed age-verification pipelines and call for adversarial robustness evaluation as a mandatory criterion for model selection.

Can a Teenager Fool an AI? Evaluating Low-Cost Cosmetic Attacks on Age Estimation Systems

TL;DR

Whether simple, household-accessible cosmetic changes, including beards, grey hair, makeup, and simulated wrinkles, can cause AI age estimators to classify minors as adults is investigated and called for adversarial robustness evaluation as a mandatory criterion for model selection.

Abstract

Age estimation systems are increasingly deployed as gatekeepers for age-restricted online content, yet their robustness to cosmetic modifications has not been systematically evaluated. We investigate whether simple, household-accessible cosmetic changes, including beards, grey hair, makeup, and simulated wrinkles, can cause AI age estimators to classify minors as adults. To study this threat at scale without ethical concerns, we simulate these physical attacks on 329 facial images of individuals aged 10 to 21 using a VLM image editor (Gemini 2.5 Flash Image). We then evaluate eight models from our prior benchmark: five specialized architectures (MiVOLO, Custom-Best, Herosan, MiViaLab, DEX) and three vision-language models (Gemini 3 Flash, Gemini 2.5 Flash, GPT-5-Nano). We introduce the Attack Conversion Rate (ACR), defined as the fraction of images predicted as minor at baseline that flip to adult after attack, a population-agnostic metric that does not depend on the ratio of minors to adults in the test set. Our results reveal that a synthetic beard alone achieves 28 to 69 percent ACR across all eight models; combining all four attacks shifts predicted age by +7.7 years on average across all 329 subjects and reaches up to 83 percent ACR; and vision-language models exhibit lower ACR (59 to 71 percent) than specialized models (63 to 83 percent) under the full attack, although the ACR ranges overlap and the difference is not statistically tested. These findings highlight a critical vulnerability in deployed age-verification pipelines and call for adversarial robustness evaluation as a mandatory criterion for model selection.
Paper Structure (50 sections, 2 equations, 6 figures, 5 tables)

This paper contains 50 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Simulated cosmetic attacks on four subjects (ground-truth ages 11--15). Predicted age shown per condition (MiVOLO). Green badge = correctly classified as minor; red badge = bypassed (predicted adult). The all-four combination reliably fools the model across all subjects.
  • Figure 2: Individual prediction trajectories (baseline $\to$ attacked) for all 329 subjects, per attack type and model class. Orange lines: threshold crossers --- subjects whose predicted age crosses the 18yr gate after attack (these are the events counted by ACR). Gray lines: non-crossers. Bold colored line: group mean trajectory. Red dashed line: 18yr decision threshold. ACR annotation gives the bypass count for each panel.
  • Figure 3: Attack effectiveness stratified by true age. Top: mean age shift per attack type and model class. Bottom: threshold-crossing rate (fraction of model-correctly-identified minors that cross the 18yr gate) per true age. Dashed red line at 50%. Younger children (ages 10--13) are paradoxically harder to bypass because the model's baseline prediction is further below the threshold; the most vulnerable age group is 15--17 where baseline predictions already cluster near 18.
  • Figure 4: Predicted age distributions before (grey) and after (colored) each attack, separately for CV-specialized and VLM models. Dashed red line = 18yr decision threshold; hatched region = bypass zone (predicted adult). VLM distributions shift rightward more uniformly; CV models show sharper concentration near the gate.
  • Figure 5: Distribution of individual age shifts ($\Delta$ predicted age) per attack, separated by model type. White dot = mean; ACR = Attack Conversion Rate (below x-axis). The beard attack shows the widest and most skewed distribution for specialized models; grey hair is more effective for VLMs. Makeup exhibits a distinctive bimodal distribution: most subjects shift by $\approx$0 yr, while a subgroup near the threshold shifts upward substantially.
  • ...and 1 more figures