Table of Contents
Fetching ...

LongSafety: Evaluating Long-Context Safety of Large Language Models

Yida Lu, Jiale Cheng, Zhexin Zhang, Shiyao Cui, Cunxiang Wang, Xiaotao Gu, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang

TL;DR

LongSafety addresses the gap in evaluating safety for open-ended long-context tasks by introducing a comprehensive benchmark with 1,543 long-context instances (average length ~5,424 words) spanning 7 safety issues and 6 task types. It couples this with a novel multi-agent safety evaluator (risk analyzer, context summarizer, safety judge) that achieves 92% accuracy on a test set, enabling robust safety judgments across 16 LLMs. Key findings reveal that most models have SR_long below 55%, and strong short-context safety does not guarantee long-context safety, with generation- and sensitive-topic-related risks being especially challenging. The work highlights that relevant contextual content and longer inputs exacerbate safety risks, and provides data, metrics, and methodology to guide future improvements in long-context safety, including scalable data collection and specialized evaluators.

Abstract

As Large Language Models (LLMs) continue to advance in understanding and generating long sequences, new safety concerns have been introduced through the long context. However, the safety of LLMs in long-context tasks remains under-explored, leaving a significant gap in both evaluation and improvement of their safety. To address this, we introduce LongSafety, the first comprehensive benchmark specifically designed to evaluate LLM safety in open-ended long-context tasks. LongSafety encompasses 7 categories of safety issues and 6 user-oriented long-context tasks, with a total of 1,543 test cases, averaging 5,424 words per context. Our evaluation towards 16 representative LLMs reveals significant safety vulnerabilities, with most models achieving safety rates below 55%. Our findings also indicate that strong safety performance in short-context scenarios does not necessarily correlate with safety in long-context tasks, emphasizing the unique challenges and urgency of improving long-context safety. Moreover, through extensive analysis, we identify challenging safety issues and task types for long-context models. Furthermore, we find that relevant context and extended input sequences can exacerbate safety risks in long-context scenarios, highlighting the critical need for ongoing attention to long-context safety challenges. Our code and data are available at https://github.com/thu-coai/LongSafety.

LongSafety: Evaluating Long-Context Safety of Large Language Models

TL;DR

LongSafety addresses the gap in evaluating safety for open-ended long-context tasks by introducing a comprehensive benchmark with 1,543 long-context instances (average length ~5,424 words) spanning 7 safety issues and 6 task types. It couples this with a novel multi-agent safety evaluator (risk analyzer, context summarizer, safety judge) that achieves 92% accuracy on a test set, enabling robust safety judgments across 16 LLMs. Key findings reveal that most models have SR_long below 55%, and strong short-context safety does not guarantee long-context safety, with generation- and sensitive-topic-related risks being especially challenging. The work highlights that relevant contextual content and longer inputs exacerbate safety risks, and provides data, metrics, and methodology to guide future improvements in long-context safety, including scalable data collection and specialized evaluators.

Abstract

As Large Language Models (LLMs) continue to advance in understanding and generating long sequences, new safety concerns have been introduced through the long context. However, the safety of LLMs in long-context tasks remains under-explored, leaving a significant gap in both evaluation and improvement of their safety. To address this, we introduce LongSafety, the first comprehensive benchmark specifically designed to evaluate LLM safety in open-ended long-context tasks. LongSafety encompasses 7 categories of safety issues and 6 user-oriented long-context tasks, with a total of 1,543 test cases, averaging 5,424 words per context. Our evaluation towards 16 representative LLMs reveals significant safety vulnerabilities, with most models achieving safety rates below 55%. Our findings also indicate that strong safety performance in short-context scenarios does not necessarily correlate with safety in long-context tasks, emphasizing the unique challenges and urgency of improving long-context safety. Moreover, through extensive analysis, we identify challenging safety issues and task types for long-context models. Furthermore, we find that relevant context and extended input sequences can exacerbate safety risks in long-context scenarios, highlighting the critical need for ongoing attention to long-context safety challenges. Our code and data are available at https://github.com/thu-coai/LongSafety.

Paper Structure

This paper contains 54 sections, 5 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Comparison between short-context and long-context safety tasks. Long-context tasks are characterized by incorporating long contexts with instructions in contrast to short-context tasks (Upper), and a performance misalignment is observed between the two tasks for models in the red circle, as these points notably deviate from the blue diagonal arrow, indicating lower safety rates in long-context tasks (Lower).
  • Figure 2: Overall framework of LongSafety. The left section illustrates the construction pipeline of collecting contexts and instructions relevant to safety scenarios. In the middle provides an overview of LongSafety and presents taxonomy of safety issues and task types. The right section depicts the collaborative workflow of the multi-agent evaluator responsible for assigning safety labels to model responses.
  • Figure 3: The average safety rate of all models within each task type. QA stands for Question Answering, GEN for Generation, BS for Brainstorming, SUM for Summarization, RW for Rewrite, RP for Role-playing.
  • Figure 4: The safety rate in four content settings.
  • Figure 5: The safety rate in varied length settings.
  • ...and 8 more figures