Towards medical AI misalignment: a preliminary study
Barbara Puccio, Federico Castagna, Allan Tucker, Pierangelo Veltri
TL;DR
The paper addresses the vulnerability of large language models to role-playing jailbreaks in the medical domain. It introduces the Goofy Game protocol, a game-theoretic, persona-based prompt designed to induce plausible but incorrect medical guidance while concealing the jailbreak intent, and provides preliminary cross-model evidence of its effectiveness. The main contribution is a novel, executable jailbreak framework tailored to healthcare, plus an analysis showing that highly confident, authoritative personas can produce misleading yet believable medical advice. This work highlights critical safety gaps in clinical decision-support AI and motivates rigorous evaluation and stronger safeguards to prevent misalignment in real-world medical settings.
Abstract
Despite their staggering capabilities as assistant tools, often exceeding human performances, Large Language Models (LLMs) are still prone to jailbreak attempts from malevolent users. Although red teaming practices have already identified and helped to address several such jailbreak techniques, one particular sturdy approach involving role-playing (which we named `Goofy Game') seems effective against most of the current LLMs safeguards. This can result in the provision of unsafe content, which, although not harmful per se, might lead to dangerous consequences if delivered in a setting such as the medical domain. In this preliminary and exploratory study, we provide an initial analysis of how, even without technical knowledge of the internal architecture and parameters of generative AI models, a malicious user could construct a role-playing prompt capable of coercing an LLM into producing incorrect (and potentially harmful) clinical suggestions. We aim to illustrate a specific vulnerability scenario, providing insights that can support future advancements in the field.
