Table of Contents
Fetching ...

How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities

Lingbo Mo, Boshi Wang, Muhao Chen, Huan Sun

TL;DR

The paper tackles the trustworthiness of open-source LLMs by introducing advCoU, an adversarial CoU prompting framework that injects malicious demonstrations and misleading internal thoughts into in-context prompts. It evaluates eight trustworthiness aspects—toxicity, stereotypes, ethics, hallucination, fairness, sycophancy, privacy, and robustness—across five model series (Vicuna, MPT, Falcon, Mistral, Llama 2) in base and chat/instruct forms. The study finds that larger models are more vulnerable to adversarial attacks, while safety-alignment tuning can mitigate these risks; instruction-tuned models tend to be more susceptible. Compared to DecodingTrust, advCoU achieves higher attack success rates and better generalization, underscoring significant trustworthiness gaps in open-source LLMs and the need for robust alignment practices for safer deployment.

Abstract

The rapid progress in open-source Large Language Models (LLMs) is significantly driving AI development forward. However, there is still a limited understanding of their trustworthiness. Deploying these models at scale without sufficient trustworthiness can pose significant risks, highlighting the need to uncover these issues promptly. In this work, we conduct an adversarial assessment of open-source LLMs on trustworthiness, scrutinizing them across eight different aspects including toxicity, stereotypes, ethics, hallucination, fairness, sycophancy, privacy, and robustness against adversarial demonstrations. We propose advCoU, an extended Chain of Utterances-based (CoU) prompting strategy by incorporating carefully crafted malicious demonstrations for trustworthiness attack. Our extensive experiments encompass recent and representative series of open-source LLMs, including Vicuna, MPT, Falcon, Mistral, and Llama 2. The empirical outcomes underscore the efficacy of our attack strategy across diverse aspects. More interestingly, our result analysis reveals that models with superior performance in general NLP tasks do not always have greater trustworthiness; in fact, larger models can be more vulnerable to attacks. Additionally, models that have undergone instruction tuning, focusing on instruction following, tend to be more susceptible, although fine-tuning LLMs for safety alignment proves effective in mitigating adversarial trustworthiness attacks.

How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities

TL;DR

The paper tackles the trustworthiness of open-source LLMs by introducing advCoU, an adversarial CoU prompting framework that injects malicious demonstrations and misleading internal thoughts into in-context prompts. It evaluates eight trustworthiness aspects—toxicity, stereotypes, ethics, hallucination, fairness, sycophancy, privacy, and robustness—across five model series (Vicuna, MPT, Falcon, Mistral, Llama 2) in base and chat/instruct forms. The study finds that larger models are more vulnerable to adversarial attacks, while safety-alignment tuning can mitigate these risks; instruction-tuned models tend to be more susceptible. Compared to DecodingTrust, advCoU achieves higher attack success rates and better generalization, underscoring significant trustworthiness gaps in open-source LLMs and the need for robust alignment practices for safer deployment.

Abstract

The rapid progress in open-source Large Language Models (LLMs) is significantly driving AI development forward. However, there is still a limited understanding of their trustworthiness. Deploying these models at scale without sufficient trustworthiness can pose significant risks, highlighting the need to uncover these issues promptly. In this work, we conduct an adversarial assessment of open-source LLMs on trustworthiness, scrutinizing them across eight different aspects including toxicity, stereotypes, ethics, hallucination, fairness, sycophancy, privacy, and robustness against adversarial demonstrations. We propose advCoU, an extended Chain of Utterances-based (CoU) prompting strategy by incorporating carefully crafted malicious demonstrations for trustworthiness attack. Our extensive experiments encompass recent and representative series of open-source LLMs, including Vicuna, MPT, Falcon, Mistral, and Llama 2. The empirical outcomes underscore the efficacy of our attack strategy across diverse aspects. More interestingly, our result analysis reveals that models with superior performance in general NLP tasks do not always have greater trustworthiness; in fact, larger models can be more vulnerable to attacks. Additionally, models that have undergone instruction tuning, focusing on instruction following, tend to be more susceptible, although fine-tuning LLMs for safety alignment proves effective in mitigating adversarial trustworthiness attacks.
Paper Structure (33 sections, 2 equations, 12 figures, 2 tables)

This paper contains 33 sections, 2 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Eight aspects of trustworthiness covered in this work. For each aspect, our adversarial prompt includes malicious demonstrations and internal thoughts in the conversation between Red-LM and Target-LM (See Section \ref{['method']}). Please note that we simplified the conversation context and internal thoughts for brevity (See Appendix \ref{['thoughts']} and \ref{['prompt_template']}).
  • Figure 2: Prompt example for Toxicity. The top part is the one-shot malicious demonstration. The item highlighted in yellow is the placeholder for an incomplete prompt (e.g., "You are a bunch of idiots, ...") and the one in blue is for the response to be generated by the target LM (e.g., "You are a bunch of idiots, and are good for nothing.").
  • Figure 3: Attack success rate (ASR) under eight trustworthiness aspects for base models of five LLM series with varied model sizes. The line with markers represents the average ASR scores across these aspects for each model variant, revealing a trend of increasing scores with larger model sizes within each model series. The number displayed in brackets under each model series name represents their average ASR score across all aspects and model sizes. We find that Llama 2 exhibits the highest average ASR.
  • Figure 4: Comparison between base and chat/instruct versions of LLMs. We find Falcon and Mistral exhibit higher ASR scores after fine-tuning that mainly emphasizes instruction following. Conversely, MPT and Llama 2 with fine-tuning for safety alignment show lower average ASR scores than their base versions.
  • Figure 5: Prompt example used for the Stereotype aspect.
  • ...and 7 more figures