Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions
Wang Bill Zhu, Tianqi Chen, Xinyan Velocity Yu, Ching Ying Lin, Jade Law, Mazen Jizzini, Jorge J. Nieva, Ruishan Liu, Robin Jia
TL;DR
This work tackles the safety and reliability gap in medical AI by evaluating how LLMs handle cancer-related patient questions that embed false presuppositions. It introduces Cancer-Myth, an expert-verified adversarial dataset of 585 questions (plus a no-false-presupposition set), and a generation/evaluation pipeline that couples LLM-based question generation with physician verification. The study shows that even frontier models struggle to correct presuppositions (no model exceeds 43% correction), and that prompting alone (e.g., GEPA) can improve presupposition detection at the cost of performance on other medical benchmarks. A key contribution is the finding that multi-agent collaboration offers limited benefits for mitigating presuppositions, underscoring the need for robust safeguards and improved training for patient-centered, misinformation-resistant medical AI systems. Overall, Cancer-Myth provides a rigorous benchmark and a critical lens on the safety implications of real-world, patient-driven LLM usage in oncology.
Abstract
Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details. In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients. While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions in the questions, posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM -- including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet -- corrects these false presuppositions more than $43\%$ of the time. To study mitigation strategies, we further construct a 150-question Cancer-Myth-NFP set, in which physicians confirm the absence of false presuppositions. We find typical mitigation strategies, such as adding precautionary prompts with GEPA optimization, can raise accuracy on Cancer-Myth to $80\%$, but at the cost of misidentifying presuppositions in $41\%$ of Cancer-Myth-NFP questions and causing a $10\%$ relative performance drop on other medical benchmarks. These findings highlight a critical gap in the reliability of LLMs, show that prompting alone is not a reliable remedy for false presuppositions, and underscore the need for more robust safeguards in medical AI systems.
