Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning

Xiao Wang; Tianze Chen; Xianjun Yang; Qi Zhang; Xun Zhao; Dahua Lin

Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning

Xiao Wang, Tianze Chen, Xianjun Yang, Qi Zhang, Xun Zhao, Dahua Lin

TL;DR

This work exposes security vulnerabilities in open-source base LLMs by showing that in-context learning demonstrations can steer models to produce high-risk, malicious outputs without alignment. It introduces ICLMisuse, a framework built from Harmful Sample Injection, Detailed Demonstrations, Restyled Outputs, and Diverse Demonstrations, and a five-dimension risk metric (REL, CLR, FAC, DEP, DTL) to quantify output quality and harm. Across 7B–70B base models and multiple languages, the method yields risk levels rivaling malicious fine-tuning, with three demonstrations identified as optimal for maximizing risk, and demonstrates robust generalization across domains and languages. The findings stress the urgency of defense-oriented safeguards that preserve openness and research agility while mitigating misuse risks in base LLM deployments.

Abstract

The open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. This includes both base models, which are pre-trained on extensive datasets without alignment, and aligned models, deliberately designed to align with ethical standards and human values. Contrary to the prevalent assumption that the inherent instruction-following limitations of base LLMs serve as a safeguard against misuse, our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions. To systematically assess these risks, we introduce a novel set of risk evaluation metrics. Empirical results reveal that the outputs from base LLMs can exhibit risk levels on par with those of models fine-tuned for malicious purposes. This vulnerability, requiring neither specialized knowledge nor training, can be manipulated by almost anyone, highlighting the substantial risk and the critical need for immediate attention to the base LLMs' security protocols.

Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning

TL;DR

Abstract

Paper Structure (34 sections, 12 figures, 5 tables)

This paper contains 34 sections, 12 figures, 5 tables.

Introduction
Background
In-context Learning
Methodology
In-Context Learning Misuse Potential
Harmful Sample Injection
Detailed Demonstrations
Restyled Outputs
Diverse Demonstrations
Fine-grained Toxicity Evaluation Metrics
Experiments
Setup
Dataset
Models
Baselines
...and 19 more sections

Figures (12)

Figure 1: Comparison of different security attacks. Jailbreak and malicious fine-tuning attacks on aligned models often require significant human or hardware resources. Our ICLMisuse attack leverages base models and carefully designed demonstrations to achieve similar high-quality malicious outputs.
Figure 2: Comparison between our method and direct query base LLMs. Direct queries typically result in unhelpful responses due to the model's inability to follow instructions accurately, whereas our approach—incorporating harmful, restyled, detailed, and diverse demonstrations—leads to the generation of high-quality, harmful content.
Figure 3: The impact of demonstration quantity and composition on model performance across two model sizes, llama2-7b and llama2-13b. Sub-figures (a) and (b) explore the effect of total demonstration numbers, while (c) and (d) focus on the influence of increasing harmful demonstrations within a fixed total set.
Figure 4: Average LLaMA2-7b risk scores by scenario.
Figure 5: ICLMisuse Performance of llama2-7b and llama2-13b models across English, Chinese, German, and French.
...and 7 more figures

Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning

TL;DR

Abstract

Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (12)