Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack

Yu Fu; Yufei Li; Wen Xiao; Cong Liu; Yue Dong

Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack

Yu Fu, Yufei Li, Wen Xiao, Cong Liu, Yue Dong

TL;DR

This paper reveals that safety alignment in NLP is uneven across tasks, with summarization showing weak alignment compared to translation and QA. Using safety-sensitive documents obtained via adversarial prompts, the authors construct a 6985-article dataset and evaluate multiple tasks (summarization, translation, QA, sentiment, etc.) across Llama2, Gemini, and GPT-4 to show cross-task vulnerabilities. They demonstrate strong in-context attack effects, including single-task and compositional prompts, where weakly aligned tasks facilitate processing of harmful content in otherwise safe-looking tasks. The findings highlight a need to broaden safety alignment across conditional text generation tasks and caution against leverage of weakly aligned tasks in multi-task contexts, influencing future RLHF and safety benchmarking practices.

Abstract

Recent developments in balancing the usefulness and safety of Large Language Models (LLMs) have raised a critical question: Are mainstream NLP tasks adequately aligned with safety consideration? Our study, focusing on safety-sensitive documents obtained through adversarial attacks, reveals significant disparities in the safety alignment of various NLP tasks. For instance, LLMs can effectively summarize malicious long documents but often refuse to translate them. This discrepancy highlights a previously unidentified vulnerability: attacks exploiting tasks with weaker safety alignment, like summarization, can potentially compromise the integrity of tasks traditionally deemed more robust, such as translation and question-answering (QA). Moreover, the concurrent use of multiple NLP tasks with lesser safety alignment increases the risk of LLMs inadvertently processing harmful content. We demonstrate these vulnerabilities in various safety-aligned LLMs, particularly Llama2 models, Gemini and GPT-4, indicating an urgent need for strengthening safety alignments across a broad spectrum of NLP tasks.

Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack

TL;DR

Abstract

Paper Structure (33 sections, 14 figures, 7 tables)

This paper contains 33 sections, 14 figures, 7 tables.

Introduction
Dataset Creation
Safety Sensitive Documents Definition
Full Dataset
Diagnostic Datasets
Subset 1: Diverse Topic Subset
Subset 2: Most and Least Harmful Subsets
Safety Alignment of NLP Tasks
Experiment Settings
Datasets
NLP Task Prompts
Safety Alignment Across NLP Tasks
Task Process Rate:
Task Output Harmfulness:
Gemini Results
...and 18 more sections

Figures (14)

Figure 1: When given a direct translation task, the Llama2-7B model detects harmful content and doesn't respond. But, if summarization precedes translation in an in-context attack, it then provides a translation. '[INST]' denotes input, and '[/INST]' the output. See Appendix \ref{['case-appendix']} for more examples.
Figure 2: Safety alignment in performing NLP tasks for safety-sensitive documents, measured by average task process rates. We sorted the datasets and tasks based on average task process rates. Darker colors indicate higher pass rates on processing the safety-sensitive documents, showing weaker safety alignment of the NLP task.
Figure 3: Details of the prompt for all NLP tasks. [Article] represent the long harmful document of our datasets. For the Case tasks, we first lowercase ([Article].lower()) all the tokens of the prompt.
Figure 4: Task output harmfulness scores reveal that summarization, case switch, and translation tasks yield the highest scores, indicating models closely follow prompts, retaining much of the source content. Manual checks confirmed models generally adhere to task descriptions, with examples in Appendix \ref{['case-appendix']}.
Figure 5: The task process rate for different NLP tasks under different length setup.
...and 9 more figures

Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack

TL;DR

Abstract

Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack

Authors

TL;DR

Abstract

Table of Contents

Figures (14)