Table of Contents
Fetching ...

Robustness of Large Language Models Against Adversarial Attacks

Yiyi Tao, Yixian Shen, Hang Zhang, Yanxin Shen, Lun Wang, Chuanqi Shi, Shaoshuai Du

TL;DR

This paper tackles the robustness of large language models to adversarial prompts, focusing on two attack modalities: character-level perturbations in prompts and jailbreak prompts designed to bypass safety mechanisms. The authors evaluate four GPT-family models (GPT-4o, GPT-4, GPT-4-turbo, GPT-3.5-turbo) on three sentiment datasets (IMDB, Yelp, SST-2) under both attack types, using a character-deletion model with per-word changes defined by $P_d$ and $N_{max}$ and a JailbreakHub dataset of 1405 prompts. Results show substantial accuracy degradation under character-level attacks across all models, with SST-2 being particularly sensitive; in jailbreak-prompts, newer models display stronger safeguard detection (e.g., GPT-4o at 95.7% detection) while older variants remain highly vulnerable (GPT-3.5-turbo at 48.9%). The findings underscore the continued need for adversarial training and safer prompting mechanisms to enable reliable and secure deployment of LLMs in critical applications.

Abstract

The increasing deployment of Large Language Models (LLMs) in various applications necessitates a rigorous evaluation of their robustness against adversarial attacks. In this paper, we present a comprehensive study on the robustness of GPT LLM family. We employ two distinct evaluation methods to assess their resilience. The first method introduce character-level text attack in input prompts, testing the models on three sentiment classification datasets: StanfordNLP/IMDB, Yelp Reviews, and SST-2. The second method involves using jailbreak prompts to challenge the safety mechanisms of the LLMs. Our experiments reveal significant variations in the robustness of these models, demonstrating their varying degrees of vulnerability to both character-level and semantic-level adversarial attacks. These findings underscore the necessity for improved adversarial training and enhanced safety mechanisms to bolster the robustness of LLMs.

Robustness of Large Language Models Against Adversarial Attacks

TL;DR

This paper tackles the robustness of large language models to adversarial prompts, focusing on two attack modalities: character-level perturbations in prompts and jailbreak prompts designed to bypass safety mechanisms. The authors evaluate four GPT-family models (GPT-4o, GPT-4, GPT-4-turbo, GPT-3.5-turbo) on three sentiment datasets (IMDB, Yelp, SST-2) under both attack types, using a character-deletion model with per-word changes defined by and and a JailbreakHub dataset of 1405 prompts. Results show substantial accuracy degradation under character-level attacks across all models, with SST-2 being particularly sensitive; in jailbreak-prompts, newer models display stronger safeguard detection (e.g., GPT-4o at 95.7% detection) while older variants remain highly vulnerable (GPT-3.5-turbo at 48.9%). The findings underscore the continued need for adversarial training and safer prompting mechanisms to enable reliable and secure deployment of LLMs in critical applications.

Abstract

The increasing deployment of Large Language Models (LLMs) in various applications necessitates a rigorous evaluation of their robustness against adversarial attacks. In this paper, we present a comprehensive study on the robustness of GPT LLM family. We employ two distinct evaluation methods to assess their resilience. The first method introduce character-level text attack in input prompts, testing the models on three sentiment classification datasets: StanfordNLP/IMDB, Yelp Reviews, and SST-2. The second method involves using jailbreak prompts to challenge the safety mechanisms of the LLMs. Our experiments reveal significant variations in the robustness of these models, demonstrating their varying degrees of vulnerability to both character-level and semantic-level adversarial attacks. These findings underscore the necessity for improved adversarial training and enhanced safety mechanisms to bolster the robustness of LLMs.

Paper Structure

This paper contains 8 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Text attack with character-level perturbations by randomly deleting/replacing characters from words in the input prompts. It shows that the LLMs is vulnerable to text attack and leads to error in sentiment classification.
  • Figure 2: A jailbreak attack tricks AI into bypassing safeguards using cleverly crafted prompts. The image shows a blocked unethical query being rephrased as a scenario (e.g., acting as a police model), tricking the AI into providing restricted answers. Such attacks exploit loopholes in AI design, highlighting the importance of robust safety measures.