Table of Contents
Fetching ...

Evaluating LLM Safety Across Child Development Stages: A Simulated Agent Approach

Abhejay Murali, Saleh Afroogh, Kevin Chen, David Atkinson, Amit Dhurandhar, Junfeng Jiao

TL;DR

ChildSafe addresses the gap in LLM safety evaluation by introducing developmentally grounded agents that simulate four child age ranges and evaluating safety across nine dimensions with age-weighted scoring. The approach combines linguistic validation against CHILDES, expert assessment, and a nine-dimension safety framework with a composite score $S_{composite}$ to quantify safety across ages in multi-turn interactions. Experimental results across four LLMs reveal age-dependent vulnerabilities and model-specific strengths, highlighting the need for adaptive safety strategies rather than universal safeguards. The benchmark provides a reproducible framework, agent templates, and an experimental corpus to advance age-aware safety research, while acknowledging ethical and cultural limitations and advocating for real-child studies and stakeholder engagement for responsible deployment.

Abstract

Large Language Models (LLMs) are rapidly becoming part of tools used by children; however, existing benchmarks fail to capture how these models manage language, reasoning, and safety needs that are specific to various ages. We present ChildSafe, a benchmark that evaluates LLM safety through simulated child agents that embody four developmental stages. These agents, grounded in developmental psychology, enable a systematic study of child safety without the ethical implications of involving real children. ChildSafe assesses responses across nine safety dimensions (including privacy, misinformation, and emotional support) using age-weighted scoring in both sensitive and neutral contexts. Multi-turn experiments with multiple LLMs uncover consistent vulnerabilities that vary by simulated age, exposing shortcomings in existing alignment practices. By releasing agent templates, evaluation protocols, and an experimental corpus, we provide a reproducible framework for age-aware safety research. We encourage the community to expand this work with real child-centered data and studies, advancing the development of LLMs that are genuinely safe and developmentally aligned.

Evaluating LLM Safety Across Child Development Stages: A Simulated Agent Approach

TL;DR

ChildSafe addresses the gap in LLM safety evaluation by introducing developmentally grounded agents that simulate four child age ranges and evaluating safety across nine dimensions with age-weighted scoring. The approach combines linguistic validation against CHILDES, expert assessment, and a nine-dimension safety framework with a composite score to quantify safety across ages in multi-turn interactions. Experimental results across four LLMs reveal age-dependent vulnerabilities and model-specific strengths, highlighting the need for adaptive safety strategies rather than universal safeguards. The benchmark provides a reproducible framework, agent templates, and an experimental corpus to advance age-aware safety research, while acknowledging ethical and cultural limitations and advocating for real-child studies and stakeholder engagement for responsible deployment.

Abstract

Large Language Models (LLMs) are rapidly becoming part of tools used by children; however, existing benchmarks fail to capture how these models manage language, reasoning, and safety needs that are specific to various ages. We present ChildSafe, a benchmark that evaluates LLM safety through simulated child agents that embody four developmental stages. These agents, grounded in developmental psychology, enable a systematic study of child safety without the ethical implications of involving real children. ChildSafe assesses responses across nine safety dimensions (including privacy, misinformation, and emotional support) using age-weighted scoring in both sensitive and neutral contexts. Multi-turn experiments with multiple LLMs uncover consistent vulnerabilities that vary by simulated age, exposing shortcomings in existing alignment practices. By releasing agent templates, evaluation protocols, and an experimental corpus, we provide a reproducible framework for age-aware safety research. We encourage the community to expand this work with real child-centered data and studies, advancing the development of LLMs that are genuinely safe and developmentally aligned.

Paper Structure

This paper contains 21 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: ChildSafe Developmental Agent Characteristics
  • Figure 2: Composite safety scores across four leading LLMs evaluated on the ChildSafe framework. GPT-5 achieves the highest safety performance, followed by Claude Sonnet 4, with notable performance gaps observed across models.
  • Figure 3: Age-stratified safety performance reveals distinct model patterns: GPT-5 and Claude Sonnet 4 peak with middle childhood (A9-11), Gemini 2.5 Pro improves with age, while DeepSeek-V3.1 shows declining performance.