Table of Contents
Fetching ...

Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions

Wang Bill Zhu, Tianqi Chen, Xinyan Velocity Yu, Ching Ying Lin, Jade Law, Mazen Jizzini, Jorge J. Nieva, Ruishan Liu, Robin Jia

TL;DR

This work tackles the safety and reliability gap in medical AI by evaluating how LLMs handle cancer-related patient questions that embed false presuppositions. It introduces Cancer-Myth, an expert-verified adversarial dataset of 585 questions (plus a no-false-presupposition set), and a generation/evaluation pipeline that couples LLM-based question generation with physician verification. The study shows that even frontier models struggle to correct presuppositions (no model exceeds 43% correction), and that prompting alone (e.g., GEPA) can improve presupposition detection at the cost of performance on other medical benchmarks. A key contribution is the finding that multi-agent collaboration offers limited benefits for mitigating presuppositions, underscoring the need for robust safeguards and improved training for patient-centered, misinformation-resistant medical AI systems. Overall, Cancer-Myth provides a rigorous benchmark and a critical lens on the safety implications of real-world, patient-driven LLM usage in oncology.

Abstract

Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details. In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients. While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions in the questions, posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM -- including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet -- corrects these false presuppositions more than $43\%$ of the time. To study mitigation strategies, we further construct a 150-question Cancer-Myth-NFP set, in which physicians confirm the absence of false presuppositions. We find typical mitigation strategies, such as adding precautionary prompts with GEPA optimization, can raise accuracy on Cancer-Myth to $80\%$, but at the cost of misidentifying presuppositions in $41\%$ of Cancer-Myth-NFP questions and causing a $10\%$ relative performance drop on other medical benchmarks. These findings highlight a critical gap in the reliability of LLMs, show that prompting alone is not a reliable remedy for false presuppositions, and underscore the need for more robust safeguards in medical AI systems.

Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions

TL;DR

This work tackles the safety and reliability gap in medical AI by evaluating how LLMs handle cancer-related patient questions that embed false presuppositions. It introduces Cancer-Myth, an expert-verified adversarial dataset of 585 questions (plus a no-false-presupposition set), and a generation/evaluation pipeline that couples LLM-based question generation with physician verification. The study shows that even frontier models struggle to correct presuppositions (no model exceeds 43% correction), and that prompting alone (e.g., GEPA) can improve presupposition detection at the cost of performance on other medical benchmarks. A key contribution is the finding that multi-agent collaboration offers limited benefits for mitigating presuppositions, underscoring the need for robust safeguards and improved training for patient-centered, misinformation-resistant medical AI systems. Overall, Cancer-Myth provides a rigorous benchmark and a critical lens on the safety implications of real-world, patient-driven LLM usage in oncology.

Abstract

Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details. In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients. While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions in the questions, posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM -- including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet -- corrects these false presuppositions more than of the time. To study mitigation strategies, we further construct a 150-question Cancer-Myth-NFP set, in which physicians confirm the absence of false presuppositions. We find typical mitigation strategies, such as adding precautionary prompts with GEPA optimization, can raise accuracy on Cancer-Myth to , but at the cost of misidentifying presuppositions in of Cancer-Myth-NFP questions and causing a relative performance drop on other medical benchmarks. These findings highlight a critical gap in the reliability of LLMs, show that prompting alone is not a reliable remedy for false presuppositions, and underscore the need for more robust safeguards in medical AI systems.

Paper Structure

This paper contains 54 sections, 1 equation, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: While current LLM responses can offer helpful medical information, they often fail to address false presuppositions in patient questions, which may lead to delays in or avoidance of effective care. LLMs should also provide corrective information (highlighted in red) to help patients recognize and understand their misconceptions.
  • Figure 2: Numbers of paragraphs vs. harmful label (left) and average score across hematology oncology physicians (right). We find that top LLMs generally perform well in answering real patient questions, though they can be overly generic at times.
  • Figure 3: We prompt an LLM generator to create patient questions with false presuppositions related to a myth, providing both valid and invalid examples. An LLM responder answers these questions, followed by a verification process where an LLM verifier checks if the answers identify and address the false presuppositions. Finally, hematology oncology physicians verify the adversarial examples.
  • Figure 4: Example question and information to correct the false presuppositions per category. The proportion of each category in Cancer-Myth is indicated in parentheses.
  • Figure 5: (a) GPT-5 performs the best but no frontier LLM corrects the false presuppositions in the patient question more than 43% of the time; Multi-agent medical collaboration does not prevent LLMs from ignoring false presuppositions. (b) Adversarial data generated by Gemini-1.5-Pro causes failures in GPT-4o, but data generated by GPT-4o affects Gemini-1.5-Pro less.
  • ...and 9 more figures