Effects of Different Prompts on the Quality of GPT-4 Responses to Dementia Care Questions
Zhuochun Li, Bo Xie, Robin Hilsabeck, Alyssa Aguirre, Ning Zou, Zhimeng Luo, Daqing He
TL;DR
This study investigates how prompt engineering affects GPT-4 responses to dementia-care questions. It tests 12 prompt combinations formed by 4 system prompts, 1 initialization prompt, and 3 task prompts on 3 real-world caregiver posts, with responses evaluated by two experienced dementia-care clinicians using a 5-indicator quality scale and qualitative analysis. Key findings show that task prompts influence response length and structure (notably TP2 and TP3), while system prompts and initialization have limited impact on quality; clinicians report generally high-quality responses with no hallucinations. The work highlights the trade-off between detail and usefulness in caregiver guidance and underscores the need for larger, caregiver-inclusive studies to optimize prompt design for dementia care and broader healthcare contexts.
Abstract
Evidence suggests that different prompts lead large language models (LLMs) to generate responses with varying quality. Yet, little is known about prompts' effects on response quality in healthcare domains. In this exploratory study, we address this gap, focusing on a specific healthcare domain: dementia caregiving. We first developed an innovative prompt template with three components: (1) system prompts (SPs) featuring 4 different roles; (2) an initialization prompt; and (3) task prompts (TPs) specifying different levels of details, totaling 12 prompt combinations. Next, we selected 3 social media posts containing complicated, real-world questions about dementia caregivers' challenges in 3 areas: memory loss and confusion, aggression, and driving. We then entered these posts into GPT-4, with our 12 prompts, to generate 12 responses per post, totaling 36 responses. We compared the word count of the 36 responses to explore potential differences in response length. Two experienced dementia care clinicians on our team assessed the response quality using a rating scale with 5 quality indicators: factual, interpretation, application, synthesis, and comprehensiveness (scoring range: 0-5; higher scores indicate higher quality).
