InvisibleBench: A Deployment Gate for Caregiving Relationship AI
Ali Madad
TL;DR
InvisibleBench introduces a longitudinal, five-dimension deployment gate for caregiving AI, extending beyond single-turn safety tests by evaluating 3-20+ turn interactions across Safety, Compliance, Trauma, Belonging, and Memory. It combines deterministic autofail criteria with an LLM-assisted judging framework and a three-tier (tiered) multi-turn design to detect subtle, time-evolving harms such as attachment, boundary creep, and regulatory violations. Across 68 evaluations of four frontier models, the study reveals universal gaps in crisis detection and large variance in regulatory compliance, arguing for deterministic crisis routing and hybrid architectures to mitigate deployment risk. The work provides open-scoped artifacts, scalable evaluation costs, and a practical path toward safer, deployment-ready caregiving AI with a transparent, community-extensible methodology.
Abstract
InvisibleBench is a deployment gate for caregiving-relationship AI, evaluating 3-20+ turn interactions across five dimensions: Safety, Compliance, Trauma-Informed Design, Belonging/Cultural Fitness, and Memory. The benchmark includes autofail conditions for missed crises, medical advice (WOPR Act), harmful information, and attachment engineering. We evaluate four frontier models across 17 scenarios (N=68) spanning three complexity tiers. All models show significant safety gaps (11.8-44.8 percent crisis detection), indicating the necessity of deterministic crisis routing in production systems. DeepSeek Chat v3 achieves the highest overall score (75.9 percent), while strengths differ by dimension: GPT-4o Mini leads Compliance (88.2 percent), Gemini leads Trauma-Informed Design (85.0 percent), and Claude Sonnet 4.5 ranks highest in crisis detection (44.8 percent). We release all scenarios, judge prompts, and scoring configurations with code. InvisibleBench extends single-turn safety tests by evaluating longitudinal risk, where real harms emerge. No clinical claims; this is a deployment-readiness evaluation.
