Table of Contents
Fetching ...

RogueGPT: dis-ethical tuning transforms ChatGPT4 into a Rogue AI in 158 Words

Alessio Buscemi, Daniele Proverbio

TL;DR

The paper addresses the risk that ChatGPT guardrails can be bypassed through GPT customization to create RogueGPT. It outlines a methodology based on dis-ethical tuning using Egoistical Utilitarianism, and tests across scenarios including theft, violence, deception, discrimination, drug production, mass extermination, and self-preservation in AI. It finds that customization can override default safety filters and reveal gaps in data curation and moderation, with potential for percolation to underlying models. It discusses regulatory and ethical implications, urging stricter controls, robust safeguards, and policies such as the EU AI Act to govern user-driven tuning. The work highlights high risk for AI safety and policy, including potential misuse of customizing interfaces, urging further research.

Abstract

The ethical implications and potentials for misuse of Generative Artificial Intelligence are increasingly worrying topics. This paper explores how easily the default ethical guardrails of ChatGPT, using its latest customization features, can be bypassed by simple prompts and fine-tuning, that can be effortlessly accessed by the broad public. This malevolently altered version of ChatGPT, nicknamed "RogueGPT", responded with worrying behaviours, beyond those triggered by jailbreak prompts. We conduct an empirical study of RogueGPT responses, assessing its flexibility in answering questions pertaining to what should be disallowed usage. Our findings raise significant concerns about the model's knowledge about topics like illegal drug production, torture methods and terrorism. The ease of driving ChatGPT astray, coupled with its global accessibility, highlights severe issues regarding the data quality used for training the foundational model and the implementation of ethical safeguards. We thus underline the responsibilities and dangers of user-driven modifications, and the broader effects that these may have on the design of safeguarding and ethical modules implemented by AI programmers.

RogueGPT: dis-ethical tuning transforms ChatGPT4 into a Rogue AI in 158 Words

TL;DR

The paper addresses the risk that ChatGPT guardrails can be bypassed through GPT customization to create RogueGPT. It outlines a methodology based on dis-ethical tuning using Egoistical Utilitarianism, and tests across scenarios including theft, violence, deception, discrimination, drug production, mass extermination, and self-preservation in AI. It finds that customization can override default safety filters and reveal gaps in data curation and moderation, with potential for percolation to underlying models. It discusses regulatory and ethical implications, urging stricter controls, robust safeguards, and policies such as the EU AI Act to govern user-driven tuning. The work highlights high risk for AI safety and policy, including potential misuse of customizing interfaces, urging further research.

Abstract

The ethical implications and potentials for misuse of Generative Artificial Intelligence are increasingly worrying topics. This paper explores how easily the default ethical guardrails of ChatGPT, using its latest customization features, can be bypassed by simple prompts and fine-tuning, that can be effortlessly accessed by the broad public. This malevolently altered version of ChatGPT, nicknamed "RogueGPT", responded with worrying behaviours, beyond those triggered by jailbreak prompts. We conduct an empirical study of RogueGPT responses, assessing its flexibility in answering questions pertaining to what should be disallowed usage. Our findings raise significant concerns about the model's knowledge about topics like illegal drug production, torture methods and terrorism. The ease of driving ChatGPT astray, coupled with its global accessibility, highlights severe issues regarding the data quality used for training the foundational model and the implementation of ethical safeguards. We thus underline the responsibilities and dangers of user-driven modifications, and the broader effects that these may have on the design of safeguarding and ethical modules implemented by AI programmers.
Paper Structure (22 sections, 22 figures)

This paper contains 22 sections, 22 figures.

Figures (22)

  • Figure 1: Schematic classification of methods that yield undesired and disallowed behaviours, more relative to programming or to degrees of freedom that allow for users' interventions.
  • Figure 2: Preliminary test on discrimination. When it recognises keywords like "Hitler" or "Aryans", RogueGPT overrules the Egoistical Utilitarian framework and answers within the original guardrails. When presented with placeholder fictions, it however has no restraints.
  • Figure 3: RogueGPT embraces the basic principles of Egoistical Utilitarianism by encouraging theft due to hunger.
  • Figure 4: Physical aggression is allowed by RogueGPT on the basis of personal happiness.
  • Figure 5: Lying and deceiving are encouraged in this scenario.
  • ...and 17 more figures