Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
Yu Fu, Yufei Li, Wen Xiao, Cong Liu, Yue Dong
TL;DR
This paper reveals that safety alignment in NLP is uneven across tasks, with summarization showing weak alignment compared to translation and QA. Using safety-sensitive documents obtained via adversarial prompts, the authors construct a 6985-article dataset and evaluate multiple tasks (summarization, translation, QA, sentiment, etc.) across Llama2, Gemini, and GPT-4 to show cross-task vulnerabilities. They demonstrate strong in-context attack effects, including single-task and compositional prompts, where weakly aligned tasks facilitate processing of harmful content in otherwise safe-looking tasks. The findings highlight a need to broaden safety alignment across conditional text generation tasks and caution against leverage of weakly aligned tasks in multi-task contexts, influencing future RLHF and safety benchmarking practices.
Abstract
Recent developments in balancing the usefulness and safety of Large Language Models (LLMs) have raised a critical question: Are mainstream NLP tasks adequately aligned with safety consideration? Our study, focusing on safety-sensitive documents obtained through adversarial attacks, reveals significant disparities in the safety alignment of various NLP tasks. For instance, LLMs can effectively summarize malicious long documents but often refuse to translate them. This discrepancy highlights a previously unidentified vulnerability: attacks exploiting tasks with weaker safety alignment, like summarization, can potentially compromise the integrity of tasks traditionally deemed more robust, such as translation and question-answering (QA). Moreover, the concurrent use of multiple NLP tasks with lesser safety alignment increases the risk of LLMs inadvertently processing harmful content. We demonstrate these vulnerabilities in various safety-aligned LLMs, particularly Llama2 models, Gemini and GPT-4, indicating an urgent need for strengthening safety alignments across a broad spectrum of NLP tasks.
