Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles

Ryan Louie; Ananjan Nandi; William Fang; Cheng Chang; Emma Brunskill; Diyi Yang

Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles

Ryan Louie, Ananjan Nandi, William Fang, Cheng Chang, Emma Brunskill, Diyi Yang

TL;DR

Roleplay-doh introduces a human-LLM collaboration framework where domain experts elicit qualitative feedback that is transformed into natural-language principles guiding an LLM-prompted AI patient for counselor training. The authors add a principle-adherence pipeline that decomposes complex principles into yes/no criteria and tests applicability to ensure reliable adherence, achieving substantial improvements in response quality and principle-following. In a study with 25 counseling experts and third-party judges, AI patients created through Roleplay-doh demonstrated higher authenticity and training readiness than scenario-only baselines, while the principle-adherence components reduced awkward dialogue and non-adherence. The work highlights a scalable approach for expert-guided simulations in sensitive domains and suggests broad applicability to other domain-specific roleplay scenarios, while acknowledging limitations of text-based interaction and ethical considerations.

Abstract

Recent works leverage LLMs to roleplay realistic social scenarios, aiding novices in practicing their social skills. However, simulating sensitive interactions, such as in mental health, is challenging. Privacy concerns restrict data access, and collecting expert feedback, although vital, is laborious. To address this, we develop Roleplay-doh, a novel human-LLM collaboration pipeline that elicits qualitative feedback from a domain-expert, which is transformed into a set of principles, or natural language rules, that govern an LLM-prompted roleplay. We apply this pipeline to enable senior mental health supporters to create customized AI patients for simulated practice partners for novice counselors. After uncovering issues in GPT-4 simulations not adhering to expert-defined principles, we also introduce a novel principle-adherence prompting pipeline which shows 30% improvements in response quality and principle following for the downstream task. Via a user study with 25 counseling experts, we demonstrate that the pipeline makes it easy and effective to create AI patients that more faithfully resemble real patients, as judged by creators and third-party counselors. See our project website at https://roleplay-doh.github.io/ for code and data.

Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles

TL;DR

Abstract

Paper Structure (42 sections, 24 figures, 11 tables)

This paper contains 42 sections, 24 figures, 11 tables.

Introduction
Related Work
Utility of Simulated Partners
Aligning Simulation with Domain Experts
Text Generation with LLMs
Designing for Simulated Roleplay
Initial Tool Design Rationale
Pilot Testing
O1: Defining "realistic" patient behavior is ambiguous
O2: 20% of responses produced by GPT-4 do not satisfy expert principles or dialogue conventions.
Roleplay-doh
Principle Elicitation
Generation with Principle-Adherence
User Study using Roleplay-doh
Creator Perceptions
...and 27 more sections

Figures (24)

Figure 1: Roleplay-doh empowers an expert counselor to create a customized AI patient intended for other novice counselors to use as a practice partner. While interacting with the AI patient, the expert counselor can provide qualitative feedback which is converted by an LLM into a principle, or a custom rule governing desired roleplay behavior. The AI patient references the updated expert-defined principles to generate its subsequent responses.
Figure 2: Principle-adherence prompting pipeline for mitigating errors in satisfying expert principles and dialogue conventions. In Stage 1, expert-defined principles are rewritten into several Yes/No questions; and the LLM generates additional principle questions that are relevant to ensure adherence to dialogue conventions such as coherence and consistency. In Stage 2, the LLM (a) evaluates whether the questions are applicable to the context and the answers to the principle-adherence questions; and (b) refines the response to ideally receive Yes on all questions.
Figure 3: Win/Tie/Loss for the Error Test Cases along Consistency with Context (M1), Principle Adherence (M3), and Overall. Pairwise preference evaluation results with [No Critique] as a baseline. Results obtained after majority voting.
Figure 4: Roleplay-doh allows users to chat with a AI patient, Provide Feedback as a Kudos/Critique/Rewrite, and Convert Feedback into Principles, which in turn shape the roleplay behavior.
Figure 5: Based on our simulation-based power analysis across 300 trials for our linear, mixed-effect model, we conclude that 80% power can be achieved with 5 third-party judges.
...and 19 more figures

Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles

TL;DR

Abstract

Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles

Authors

TL;DR

Abstract

Table of Contents

Figures (24)