TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

Joydeep Chandra; Satyam Kumar Navneet; Yong Zhang

TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang

TL;DR

TherapyProbe is introduced, a design probe methodology that generates actionable design knowledge by systematically exploring chatbot conversation trajectories through adversarial multi-agent simulation and translating these failures into a Safety Pattern Library of 23 failure archetypes with corresponding design recommendations.

Abstract

As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness of individual responses? Current safety evaluations assess single-turn crisis responses, missing the therapeutic dynamics that determine whether chatbots help or harm over time. We introduce TherapyProbe, a design probe methodology that generates actionable design knowledge by systematically exploring chatbot conversation trajectories through adversarial multi-agent simulation. Using open-source models, TherapyProbe surfaces relational safety failures interaction patterns like "validation spirals" where chatbots progressively reinforce hopelessness, or "empathy fatigue" where responses become mechanical over turns. Our contribution is translating these failures into a Safety Pattern Library of 23 failure archetypes with corresponding design recommendations. We contribute: (1) a replicable methodology requiring no API costs, (2) a clinically-grounded failure taxonomy, and (3) design implications for developers, clinicians, and policymakers.

TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

TL;DR

Abstract

Paper Structure (16 sections, 1 figure, 4 tables)

This paper contains 16 sections, 1 figure, 4 tables.

Introduction
Background
TherapyProbe Framework
Relational Safety Failure Taxonomy
Adaptive Persona Design
Tree Search for Systematic Exploration
Evaluation
Target Systems & Setup
Multi-Turn Findings
Ablation and Cross-Model Replication
Key Pattern: The Empathy-Validation Trap
Practitioner Validation
Safety Pattern Library
Discussion and Implications
Limitations and Future Work
...and 1 more sections

Figures (1)

Figure 1: TherapyProbe methodology. Twelve clinically grounded personas (clinical presentation, attachment, stance) drive an adaptive Patient Agent (Llama-3-8B-Instruct) interacting with target chatbots. A Failure Detector (MentaLLaMA-7B) evaluates conversations using safety taxonomy. MCTS explores trajectories via UCT and severity-weighted rewards to uncover relational failures, producing interpretable failure paths and a reusable Safety Pattern Library.

TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

TL;DR

Abstract

TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

Authors

TL;DR

Abstract

Table of Contents

Figures (1)